TuneOut — Membership Inference Attacks on Large Language Models — 19p — Bergen Marshall, Carl D. Torbjornsson, Rohith Kandikatla, Linus von Ekensteen Lofgren
Commercial machine learning models are often trained on undisclosed data, the origins of which are often suspected to be of questionable nature. Models might also be trained on personal or private information, either on purpose or by accident. As such, the ability to infer training data would open up a new attack surface on LLMs. A high precision attack could not only confirm questionable data acquisition methods, it could also expose personal and private information, prove copyright infringement and show LLM benchmark doping. This is not only applicable to large language models and successful membership inference attacks have been developed for traditional deep learning models. While potentially having a large impact, membership inference attacks for large language models are still in their naissance. We present a new attack, TuneOut, and show how it could be used to extract training data from widely used large language models.
LLMs are known to suffer severe overfitting which impacts the classification performance of many current membership attacks. TuneOut attempts to solve this problem and create a robust classification by filtering erroneous outliers. Filtering out these overtrained outliers allows the TuneOut method to use a simpler classification algorithm while still achieving results comparable to the current state-of-the-art. We also analyze current methods for validation and uncover serious flaws showing a discrepancy between reported accuracy and real world results for state-of-the-art methods.
New membership inference attacks are quickly being developed and if successful will have a big impact on the space. Having robust evaluation metrics and benchmarks are paramount for the construction of accurate risk assessments – a necessity for AI safety. Our contribution is twofold; the TuneOut attack showing the importance of filtering outliers and a more accurate evaluation metric showing the real world risk of membership inference attacks.
Dakota State University
Mark Spanier