Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.03721
Cited By
Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability
26 November 2023
Simon Lermen
Ondvrej Kvapil
ELM
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Exploring the Robustness of Model-Graded Evaluations and Automated Interpretability"
2 / 2 papers shown
Title
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
38
0
0
10 Apr 2025
Harmonic LLMs are Trustworthy
Nicholas S. Kersting
Mohammad Rahman
Suchismitha Vedala
Yang Wang
45
0
0
30 Apr 2024
1