SRI Seminar Series: Ethan Perez, “Discovering language model behaviors with model-written evaluations”
In this talk, Perez explores to what extent language models can be evaluated through processes generated by the models themselves, presenting recent research that demonstrates successful results, as well as enabling the discovery of novel behaviours within the models.