Ideas

AI Safety & Mechanistic Interpretability

Research notes and projects on AI safety, evaluations, goal drift, and mechanistic interpretability of model behavior.