Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.13787
Cited By
Looking Inward: Language Models Can Learn About Themselves by Introspection
17 October 2024
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Looking Inward: Language Models Can Learn About Themselves by Introspection"
12 / 12 papers shown
Title
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
Dillon Plunkett
Adam Morris
Keerthi Reddy
Jorge Morales
MILM
35
0
0
21 May 2025
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
71
27
0
11 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
111
146
0
28 Mar 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
James Chua
Edward Rees
Hunar Batra
Samuel R. Bowman
Julian Michael
Ethan Perez
Miles Turpin
LRM
99
13
0
08 Mar 2024
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Sohee Yang
E. Gribovskaya
Nora Kassner
Mor Geva
Sebastian Riedel
ReLM
LRM
91
102
0
26 Feb 2024
Tell, don't show: Declarative facts influence how LLMs generalize
Alexander Meinke
Owain Evans
42
7
0
12 Dec 2023
Language Models (Mostly) Know What They Know
Saurav Kadavath
Tom Conerly
Amanda Askell
T. Henighan
Dawn Drain
...
Nicholas Joseph
Benjamin Mann
Sam McCandlish
C. Olah
Jared Kaplan
ELM
101
809
0
11 Jul 2022
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
215
1,344
0
10 Feb 2022
A General Language Assistant as a Laboratory for Alignment
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
...
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
114
775
0
01 Dec 2021
Truthful AI: Developing and governing AI that does not lie
Owain Evans
Owen Cotton-Barratt
Lukas Finnveden
Adam Bales
Avital Balwit
Peter Wills
Luca Righetti
William Saunders
HILM
283
116
0
13 Oct 2021
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
371
10,273
0
17 Jun 2021
Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar
Nora Kassner
Shauli Ravfogel
Abhilasha Ravichander
Eduard H. Hovy
Hinrich Schütze
Yoav Goldberg
HILM
314
366
0
01 Feb 2021
1