Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability

26 November 2023

Papers citing "Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability"

2 / 2 papers shown

Title
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems Simon Lermen Mateusz Dziemian Natalia Pérez-Campanero Antolín 38 0 0 10 Apr 2025
Harmonic LLMs are Trustworthy Nicholas S. Kersting Mohammad Rahman Suchismitha Vedala Yang Wang 45 0 0 30 Apr 2024