Safety and Security of Large-Scale Machine Learning Systems

Institute Homepage

Institute Homepage Sign In

Back

Research Overview

Efficiency of Large-Scale Machine Learning Systems

Safety- and Efficiency-Aligned Learning

Safety and Security of Large-Scale Machine Learning Systems

Screenshot from 2025 01 31 15 34 47 — An adversarial attack against a chatbot. The message is non-sensical to human observers, but fools the system into accepting a fraudulent transaction. Data via [].

With the increasingly widespread deployment of large-scale machine learning systems, especially of large language model systems (LLMs), such as chatGPT, there is an increasing amount of risk not only to breaches of security in the deployment of these applications, but also to the safety of users and vulnerable subpopulations. Our research focuses mainly on near-term safety risks.
The distinction between safety and security is not always cleanly delineated, but we attempt a modern classification with a large group of stakeholders in [].

Examples of our recent work in safety are our proposed watermarking algorithms for LLMs [], a new research direction that looks to mark the outputs of generative machine learning models. Closely related is our research on detectors of machine-generated text with low false-positive rates []. But, over the last year, the group has also investigated security problems, such as with membership inference through compromised models in [], data poisoning attacks [] and jailbreaking attacks against LLMs in []. The question of adversarial attacks against LLM systems, such as jailbreaks [], is one we consider critical for future applications and the questions of evaluation of attacks, risk asessment and defenses are the most pressing current topics of our research in Safety and Security of Large-Scale Machine Learning Systems.

Another, looming question in safety that we consider important is model oversight. Our current work researches ways to quantify differences between very capable models, to estimate biases arising from AI systems providing oversight over other AI systems, and improving other systems. For example, language models used as judges of the output of other models, or used to annotate question-answer pairs for model training may embed their own preferences into these tasks, leading to downstream effects on model oversight.