Our weekly SRI Seminar Series welcomes Ethan Perez, who leads an AI safety research team at Anthropic, a research lab developing large language models. Perez received his PhD from New York University, and is a co-principal investigator in the AI Alignment Research Group at New York University. Previously, he has spent time at DeepMind, Meta AI, Montreal Institute for Learning Algorithms, Rice University, Uber, and Google.
In this talk, Perez explores to what extent language models can be evaluated through processes generated by the models themselves, presenting recent research that demonstrates successful results, as well as enabling the discovery of novel behaviours within the models.
“Discovering language model behaviors with model-written evaluations”
As language models (LMs) scale, they develop many novel behaviors, both good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations using crowd work (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate many evaluations using LMs themselves. We explore several methods with varying amounts of human effort, from instructing LMs to generate yes/no questions to creating complex Winogender schemas using multiple stages of LM-based generation and filtering. Crowdworkers rate the data as highly relevant and agree with the labels 90–100% of time. We generate 155 datasets and discover several new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user’s preferred answer (“sycophancy”) and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF training leads to worse behavior. For example, RLHF training makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-generated evaluations are high-quality and enable us to quickly discover many novel LM behaviours.
Ethan Perez leads an AI safety research team at Anthropic, a research lab developing large language models. His work aims to reduce the risk of catastrophic outcomes from advanced machine learning systems. He received his PhD from New York University under the supervision of Kyunghyun Cho and Douwe Kiela, funded by the National Science Foundation and Open Philanthropy Project. Perez co-founded FAR AI, an AI safety research non-profit, and he is a co-principal investigator in the AI Alignment Research Group at New York University. Previously, Perez has spent time at DeepMind, Meta AI, Montreal Institute for Learning Algorithms, Rice University, Uber, and Google.
To register for the event, visit the official event page.
The SRI Seminar Series brings together the Schwartz Reisman community and beyond for a robust exchange of ideas that advance scholarship at the intersection of technology and society. Seminars are led by a leading or emerging scholar and feature extensive discussion.
Each week, a featured speaker will present for 45 minutes, followed by an open discussion. Registered attendees will be emailed a Zoom link before the event begins. The event will be recorded and posted online.