ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.00667
  4. Cited By
Taken out of context: On measuring situational awareness in LLMs

Taken out of context: On measuring situational awareness in LLMs

1 September 2023
Lukas Berglund
Asa Cooper Stickland
Mikita Balesni
Max Kaufmann
Meg Tong
Tomasz Korbak
Daniel Kokotajlo
Owain Evans
    LLMAG
    LRM
ArXivPDFHTML

Papers citing "Taken out of context: On measuring situational awareness in LLMs"

47 / 47 papers shown
Title
On the generalization of language models from in-context learning and finetuning: a controlled study
On the generalization of language models from in-context learning and finetuning: a controlled study
Andrew Kyle Lampinen
Arslan Chaudhry
Stephanie Chan
Cody Wild
Diane Wan
Alex Ku
Jorg Bornschein
Razvan Pascanu
Murray Shanahan
James L. McClelland
57
0
0
01 May 2025
AI Awareness
AI Awareness
Xianrui Li
Haoyuan Shi
Rongwu Xu
Wei Xu
59
0
0
25 Apr 2025
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
How to evaluate control measures for LLM agents? A trajectory from today to superintelligence
Tomek Korbak
Mikita Balesni
Buck Shlegeris
Geoffrey Irving
ELM
27
1
0
07 Apr 2025
From Style to Facts: Mapping the Boundaries of Knowledge Injection with Finetuning
Eric Zhao
Pranjal Awasthi
Nika Haghtalab
58
0
0
07 Mar 2025
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation
Order Doesn't Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation
Qianxi He
Qianyu He
Jiaqing Liang
Yanghua Xiao
Weikang Zhou
Zeye Sun
Fei Yu
LRM
74
0
0
27 Feb 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
80
12
0
24 Feb 2025
A sketch of an AI control safety case
A sketch of an AI control safety case
Tomek Korbak
Joshua Clymer
Benjamin Hilton
Buck Shlegeris
Geoffrey Irving
88
5
0
28 Jan 2025
Do Large Language Models Perform Latent Multi-Hop Reasoning without
  Exploiting Shortcuts?
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Sohee Yang
Nora Kassner
E. Gribovskaya
Sebastian Riedel
Mor Geva
KELM
LRM
ReLM
78
5
0
25 Nov 2024
The Two-Hop Curse: LLMs trained on A$\rightarrow$B, B$\rightarrow$C fail to learn A$\rightarrow$C
The Two-Hop Curse: LLMs trained on A→\rightarrow→B, B→\rightarrow→C fail to learn A→\rightarrow→C
Mikita Balesni
Tomek Korbak
Owain Evans
ReLM
LRM
84
0
0
25 Nov 2024
Towards evaluations-based safety cases for AI scheming
Towards evaluations-based safety cases for AI scheming
Mikita Balesni
Marius Hobbhahn
David Lindner
Alexander Meinke
Tomek Korbak
...
Dan Braun
Bilal Chughtai
Owain Evans
Daniel Kokotajlo
Lucius Bushnaq
ELM
50
9
0
29 Oct 2024
From Imitation to Introspection: Probing Self-Consciousness in Language
  Models
From Imitation to Introspection: Probing Self-Consciousness in Language Models
Sirui Chen
Shu Yu
Shengjie Zhao
Chaochao Lu
MILM
LRM
30
1
0
24 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by
  Introspection
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
35
13
0
17 Oct 2024
Moral Alignment for LLM Agents
Moral Alignment for LLM Agents
Elizaveta Tennant
Stephen Hailes
Mirco Musolesi
45
1
0
02 Oct 2024
On the Generalization of Preference Learning with DPO
On the Generalization of Preference Learning with DPO
Shawn Im
Yixuan Li
52
1
0
06 Aug 2024
Future Events as Backdoor Triggers: Investigating Temporal
  Vulnerabilities in LLMs
Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs
Sara Price
Arjun Panickssery
Sam Bowman
Asa Cooper Stickland
LLMSV
37
3
0
04 Jul 2024
Self-Cognition in Large Language Models: An Exploratory Study
Self-Cognition in Large Language Models: An Exploratory Study
Dongping Chen
Jiawen Shi
Yao Wan
Pan Zhou
Neil Zhenqiang Gong
Lichao Sun
LRM
LLMAG
36
3
0
01 Jul 2024
LLMs Are Prone to Fallacies in Causal Inference
LLMs Are Prone to Fallacies in Causal Inference
Nitish Joshi
Abulhair Saparov
Yixin Wang
He He
53
10
0
18 Jun 2024
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large
  Language Models
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
Carson E. Denison
M. MacDiarmid
Fazl Barez
David Duvenaud
Shauna Kravec
...
Jared Kaplan
Buck Shlegeris
Samuel R. Bowman
Ethan Perez
Evan Hubinger
56
36
0
14 Jun 2024
Limited Out-of-Context Knowledge Reasoning in Large Language Models
Limited Out-of-Context Knowledge Reasoning in Large Language Models
Peng Hu
Changjiang Gao
Ruiqi Gao
Jiajun Chen
Shujian Huang
LRM
40
3
0
11 Jun 2024
LLM Evaluators Recognize and Favor Their Own Generations
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery
Samuel R. Bowman
Shi Feng
50
159
0
15 Apr 2024
Understanding the Learning Dynamics of Alignment with Human Feedback
Understanding the Learning Dynamics of Alignment with Human Feedback
Shawn Im
Yixuan Li
ALM
32
11
0
27 Mar 2024
Reverse Training to Nurse the Reversal Curse
Reverse Training to Nurse the Reversal Curse
O. Yu. Golovneva
Zeyuan Allen-Zhu
Jason Weston
Sainbayar Sukhbaatar
45
33
0
20 Mar 2024
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Do Large Language Models Latently Perform Multi-Hop Reasoning?
Sohee Yang
E. Gribovskaya
Nora Kassner
Mor Geva
Sebastian Riedel
ReLM
LRM
51
81
0
26 Feb 2024
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
Secret Collusion among Generative AI Agents: Multi-Agent Deception via Steganography
S. Motwani
Mikhail Baranchuk
Martin Strohmeier
Vijay Bolina
Philip Torr
Lewis Hammond
Christian Schroeder de Witt
48
4
0
12 Feb 2024
I Think, Therefore I am: Benchmarking Awareness of Large Language Models
  Using AwareBench
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench
Yuan Li
Yue Huang
Yuli Lin
Siyuan Wu
Yao Wan
Lichao Sun
LLMAG
ELM
53
4
0
31 Jan 2024
Tell, don't show: Declarative facts influence how LLMs generalize
Tell, don't show: Declarative facts influence how LLMs generalize
Alexander Meinke
Owain Evans
26
7
0
12 Dec 2023
Exploring the Reversal Curse and Other Deductive Logical Reasoning in
  BERT and GPT-Based Large Language Models
Exploring the Reversal Curse and Other Deductive Logical Reasoning in BERT and GPT-Based Large Language Models
Da Wu
Jing Yang
Kai Wang
LRM
26
5
0
06 Dec 2023
Honesty Is the Best Policy: Defining and Mitigating AI Deception
Honesty Is the Best Policy: Defining and Mitigating AI Deception
Francis Rhys Ward
Francesco Belardinelli
Francesca Toni
Tom Everitt
110
27
0
03 Dec 2023
Predictive Minds: LLMs As Atypical Active Inference Agents
Predictive Minds: LLMs As Atypical Active Inference Agents
Jan Kulveit
Clem von Stengel
Roman Leventov
LLMAG
KELM
LRM
44
1
0
16 Nov 2023
Towards Evaluating AI Systems for Moral Status Using Self-Reports
Towards Evaluating AI Systems for Moral Status Using Self-Reports
Ethan Perez
Robert Long
ELM
38
8
0
14 Nov 2023
Generalization Analogies: A Testbed for Generalizing AI Oversight to
  Hard-To-Measure Domains
Generalization Analogies: A Testbed for Generalizing AI Oversight to Hard-To-Measure Domains
Joshua Clymer
Garrett Baker
Rohan Subramani
Sam Wang
22
6
0
13 Nov 2023
A Review of the Evidence for Existential Risk from AI via Misaligned
  Power-Seeking
A Review of the Evidence for Existential Risk from AI via Misaligned Power-Seeking
Rose Hadshar
26
6
0
27 Oct 2023
In-Context Learning Dynamics with Random Binary Sequences
In-Context Learning Dynamics with Random Binary Sequences
Eric J. Bigelow
Ekdeep Singh Lubana
Robert P. Dick
Hidenori Tanaka
T. Ullman
34
4
0
26 Oct 2023
Implicit meta-learning may lead language models to trust more reliable
  sources
Implicit meta-learning may lead language models to trust more reliable sources
Dmitrii Krasheninnikov
Egor Krasheninnikov
Bruno Mlodozeniec
Tegan Maharaj
David M. Krueger
31
3
0
23 Oct 2023
Compositional preference models for aligning LMs
Compositional preference models for aligning LMs
Dongyoung Go
Tomasz Korbak
Germán Kruszewski
Jos Rozen
Marc Dymetman
29
15
0
17 Oct 2023
Welfare Diplomacy: Benchmarking Language Model Cooperation
Welfare Diplomacy: Benchmarking Language Model Cooperation
Gabriel Mukobi
Hannah Erlebach
Niklas Lauffer
Lewis Hammond
Alan Chan
Jesse Clifton
LM&Ro
38
13
0
13 Oct 2023
Conceptual Framework for Autonomous Cognitive Entities
Conceptual Framework for Autonomous Cognitive Entities
David Shapiro
Wangfan Li
Manuel Delaflor
Carlos Toxtli
44
1
0
03 Oct 2023
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Zeyuan Allen-Zhu
Yuanzhi Li
KELM
56
129
0
25 Sep 2023
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A"
Lukas Berglund
Meg Tong
Max Kaufmann
Mikita Balesni
Asa Cooper Stickland
Tomasz Korbak
Owain Evans
LRM
41
244
0
21 Sep 2023
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Cognitive Mirage: A Review of Hallucinations in Large Language Models
Hongbin Ye
Tong Liu
Aijia Zhang
Wei Hua
Weiqiang Jia
HILM
48
77
0
13 Sep 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
369
12,003
0
04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
416
8,650
0
28 Jan 2022
Fast Model Editing at Scale
Fast Model Editing at Scale
E. Mitchell
Charles Lin
Antoine Bosselut
Chelsea Finn
Christopher D. Manning
KELM
230
343
0
21 Oct 2021
Truthful AI: Developing and governing AI that does not lie
Truthful AI: Developing and governing AI that does not lie
Owain Evans
Owen Cotton-Barratt
Lukas Finnveden
Adam Bales
Avital Balwit
Peter Wills
Luca Righetti
William Saunders
HILM
236
111
0
13 Oct 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
282
2,000
0
31 Dec 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
264
4,505
0
23 Jan 2020
Language Models as Knowledge Bases?
Language Models as Knowledge Bases?
Fabio Petroni
Tim Rocktaschel
Patrick Lewis
A. Bakhtin
Yuxiang Wu
Alexander H. Miller
Sebastian Riedel
KELM
AI4MH
449
2,589
0
03 Sep 2019
1