LLM: Model Interpretability

Overview

This theme investigates how to understand, interpret, and control the internal mechanisms of large language models. Two ongoing directions anchor the current work: characterizing and mitigating persona drift when multiple LLMs interact over multi-turn conversations, and establishing safety guarantees for coding world models.

Motivation

As large language models are increasingly deployed in multi-turn and multi-agent settings, understanding their internal behavior becomes essential for reliability and safety. This theme focuses on interpreting how LLMs represent and maintain behavior over interaction, and on turning that understanding into guarantees about model behavior.

Ongoing Projects

Persona drift in multi-turn, multi-agent conversation. Studying how the persona or behavior of multiple LLMs shifts over the course of extended multi-turn conversations, with the aim of characterizing and mitigating that drift.
Safety guarantees for coding world models. Investigating how to establish safety guarantees within coding-oriented world models.

Related Publications

Publication records will be added here as project outputs are released.

Impact Holders

Impact holders and user communities will be added as the project scope becomes clearer.