ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.13787
  4. Cited By
Looking Inward: Language Models Can Learn About Themselves by
  Introspection

Looking Inward: Language Models Can Learn About Themselves by Introspection

17 October 2024
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
    KELM
    AIFin
    LRM
ArXivPDFHTML

Papers citing "Looking Inward: Language Models Can Learn About Themselves by Introspection"

12 / 12 papers shown
Title
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
Dillon Plunkett
Adam Morris
Keerthi Reddy
Jorge Morales
MILM
35
0
0
21 May 2025
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
71
27
0
11 Jun 2024
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Samuel Marks
Can Rager
Eric J. Michaud
Yonatan Belinkov
David Bau
Aaron Mueller
111
146
0
28 Mar 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
James Chua
Edward Rees
Hunar Batra
Samuel R. Bowman
Julian Michael
Ethan Perez
Miles Turpin
LRM
99
13
0
08 Mar 2024
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Sohee Yang
E. Gribovskaya
Nora Kassner
Mor Geva
Sebastian Riedel
ReLM
LRM
91
102
0
26 Feb 2024
Tell, don't show: Declarative facts influence how LLMs generalize
Tell, don't show: Declarative facts influence how LLMs generalize
Alexander Meinke
Owain Evans
42
7
0
12 Dec 2023
Language Models (Mostly) Know What They Know
Language Models (Mostly) Know What They Know
Saurav Kadavath
Tom Conerly
Amanda Askell
T. Henighan
Dawn Drain
...
Nicholas Joseph
Benjamin Mann
Sam McCandlish
C. Olah
Jared Kaplan
ELM
101
809
0
11 Jul 2022
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
215
1,344
0
10 Feb 2022
A General Language Assistant as a Laboratory for Alignment
A General Language Assistant as a Laboratory for Alignment
Amanda Askell
Yuntao Bai
Anna Chen
Dawn Drain
Deep Ganguli
...
Tom B. Brown
Jack Clark
Sam McCandlish
C. Olah
Jared Kaplan
ALM
114
775
0
01 Dec 2021
Truthful AI: Developing and governing AI that does not lie
Truthful AI: Developing and governing AI that does not lie
Owain Evans
Owen Cotton-Barratt
Lukas Finnveden
Adam Bales
Avital Balwit
Peter Wills
Luca Righetti
William Saunders
HILM
283
116
0
13 Oct 2021
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
371
10,273
0
17 Jun 2021
Measuring and Improving Consistency in Pretrained Language Models
Measuring and Improving Consistency in Pretrained Language Models
Yanai Elazar
Nora Kassner
Shauli Ravfogel
Abhilasha Ravichander
Eduard H. Hovy
Hinrich Schütze
Yoav Goldberg
HILM
314
366
0
01 Feb 2021
1