Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.13382
Cited By
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task
24 October 2022
Kenneth Li
Aspen K. Hopkins
David Bau
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
MILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task"
50 / 56 papers shown
Title
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis
Akarsh Kumar
Jeff Clune
Joel Lehman
Kenneth O. Stanley
OOD
21
0
0
16 May 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
90
0
0
05 Mar 2025
(How) Do Language Models Track State?
Belinda Z. Li
Zifan Carl Guo
Jacob Andreas
LRM
49
0
0
04 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
102
0
0
24 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis
Ge Lei
Samuel J. Cooper
KELM
49
0
0
15 Feb 2025
Revisiting Rogers' Paradox in the Context of Human-AI Interaction
K. M. Collins
Umang Bhatt
Ilia Sucholutsky
49
1
0
16 Jan 2025
ICLR: In-Context Learning of Representations
Core Francisco Park
Andrew Lee
Ekdeep Singh Lubana
Yongyi Yang
Maya Okawa
Kento Nishi
Martin Wattenberg
Hidenori Tanaka
AIFin
120
3
0
29 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks
Alex F Spies
William Edwards
Michael Ivanitskiy
Adrians Skapars
Tilman Rauker
Katsumi Inoue
A. Russo
Murray Shanahan
152
1
0
16 Dec 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention
Usha Bhalla
Suraj Srinivas
Asma Ghandeharioun
Himabindu Lakkaraju
42
5
0
07 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling
Emanuele Marconato
Sébastien Lachapelle
Sebastian Weichwald
Luigi Gresele
69
3
0
30 Oct 2024
Do LLMs "know" internally when they follow instructions?
Juyeon Heo
Christina Heinze-Deml
Oussama Elachqar
Shirley Ren
Udhay Nallasamy
Andy Miller
Kwan Ho Ryan Chan
Jaya Narain
51
5
0
18 Oct 2024
Systems with Switching Causal Relations: A Meta-Causal Perspective
Moritz Willig
Tim Nelson Tobiasch
Florian Peter Busch
Jonas Seng
Devendra Singh Dhami
Kristian Kersting
CML
46
0
0
16 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon
Manish Shrivastava
David M. Krueger
Ekdeep Singh Lubana
50
7
0
15 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure
Yuxiao Li
Eric J. Michaud
David D. Baek
Joshua Engels
Xiaoqing Sun
Max Tegmark
55
7
0
10 Oct 2024
Organizing Unstructured Image Collections using Natural Language
Mingxuan Liu
Zhun Zhong
Jun Li
Gianni Franchi
Subhankar Roy
Elisa Ricci
VLM
39
3
0
07 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations
Nick Jiang
Anish Kachinthaya
Suzie Petryk
Yossi Gandelsman
VLM
34
16
0
03 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
59
2
0
30 Sep 2024
Counterfactual Token Generation in Large Language Models
Ivi Chatzi
N. C. Benz
Eleni Straitouri
Stratis Tsirtsis
Manuel Gomez Rodriguez
LRM
34
3
0
25 Sep 2024
Can Transformers Do Enumerative Geometry?
Baran Hashemi
Roderic G. Corominas
Alessandro Giacchetto
44
2
0
27 Aug 2024
Understanding Generative AI Content with Embedding Models
Max Vargas
Reilly Cannon
A. Engel
Anand D. Sarwate
Tony Chiang
54
3
0
19 Aug 2024
Probabilistic Parameter Estimators and Calibration Metrics for Pose Estimation from Image Features
Romeo Valentin
Sydney M. Katz
Joonghyun Lee
Don Walker
Matthew Sorgenfrei
Mykel J. Kochenderfer
36
0
0
23 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust
Joseph Miller
Bilal Chughtai
William Saunders
53
7
0
11 Jul 2024
A Text-to-Game Engine for UGC-Based Role-Playing Games
Lei Zhang
Xuezheng Peng
Shuyi Yang
Feiyang Wang
37
1
0
11 Jul 2024
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
46
8
0
27 Jun 2024
Does ChatGPT Have a Mind?
Simon Goldstein
B. Levinstein
AI4MH
LRM
42
5
0
27 Jun 2024
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Matteo Bortoletto
Constantin Ruhdorfer
Lei Shi
Andreas Bulling
AI4MH
LRM
48
4
0
25 Jun 2024
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Dyah Adila
Shuai Zhang
Boran Han
Yuyang Wang
AAML
LLMSV
34
6
0
05 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
50
27
0
03 Jun 2024
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
42
7
0
31 May 2024
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
Shenyuan Gao
Jiazhi Yang
Li Chen
Kashyap Chitta
Yihang Qiu
Andreas Geiger
Jun Zhang
Hongyang Li
71
75
0
27 May 2024
What is it for a Machine Learning Model to Have a Capability?
Jacqueline Harding
Nathaniel Sharadin
ELM
40
3
0
14 May 2024
Test-Time Model Adaptation with Only Forward Passes
Shuaicheng Niu
Chunyan Miao
Guohao Chen
Pengcheng Wu
Peilin Zhao
TTA
43
19
0
02 Apr 2024
A Survey on Large Language Model-Based Game Agents
Sihao Hu
Tiansheng Huang
Gaowen Liu
Ramana Rao Kompella
Gaowen Liu
Selim Furkan Tekin
Yichang Xu
Zachary Yahn
Ling Liu
LLMAG
LM&Ro
AI4CE
LM&MA
71
51
0
02 Apr 2024
Language Models Represent Beliefs of Self and Others
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
50
7
0
28 Feb 2024
Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology
Zhenhua Wang
Wei Xie
Baosheng Wang
Enze Wang
Zhiwen Gui
Shuoyoucheng Ma
Kai Chen
36
14
0
24 Feb 2024
Opening the AI black box: program synthesis via mechanistic interpretability
Eric J. Michaud
Isaac Liao
Vedang Lad
Ziming Liu
Anish Mudide
Chloe Loughridge
Zifan Carl Guo
Tara Rezaei Kheirkhah
Mateja Vukelić
Max Tegmark
23
12
0
07 Feb 2024
Learning Universal Predictors
Jordi Grau-Moya
Tim Genewein
Marcus Hutter
Laurent Orseau
Grégoire Delétang
...
Anian Ruoss
Wenliang Kevin Li
Christopher Mattern
Matthew Aitchison
J. Veness
27
11
0
26 Jan 2024
Labeling Neural Representations with Inverse Recognition
Kirill Bykov
Laura Kopf
Shinichi Nakajima
Marius Kloft
Marina M.-C. Höhne
BDL
29
15
0
22 Nov 2023
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks
Rahul Ramesh
Ekdeep Singh Lubana
Mikail Khona
Robert P. Dick
Hidenori Tanaka
CoGe
39
7
0
21 Nov 2023
Predictive Minds: LLMs As Atypical Active Inference Agents
Jan Kulveit
Clem von Stengel
Roman Leventov
LLMAG
KELM
LRM
44
1
0
16 Nov 2023
Divergences between Language Models and Human Brains
Yuchen Zhou
Emmy Liu
Graham Neubig
Michael J. Tarr
Leila Wehbe
35
1
0
15 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing
Michael A. Lepori
Thomas Serre
Ellie Pavlick
75
7
0
07 Nov 2023
Language Models Represent Space and Time
Wes Gurnee
Max Tegmark
47
142
0
03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Fred Zhang
Neel Nanda
LLMSV
36
100
0
27 Sep 2023
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation
Xinshuo Hu
Dongfang Li
Baotian Hu
Zihao Zheng
Zhenyu Liu
Hao Fei
KELM
MU
33
26
0
16 Aug 2023
Domain-specific ChatBots for Science using Embeddings
Kevin G. Yager
32
8
0
15 Jun 2023
Passive learning of active causal strategies in agents and language models
Andrew Kyle Lampinen
Stephanie C. Y. Chan
Ishita Dasgupta
A. Nam
Jane X. Wang
29
15
0
25 May 2023
The Vector Grounding Problem
Dimitri Coelho Mollo
Raphael Milliere
44
26
0
04 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose
Zach Furman
Logan Smith
Danny Halawi
Igor V. Ostrovsky
Lev McKinney
Stella Biderman
Jacob Steinhardt
22
194
0
14 Mar 2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations
Bilal Chughtai
Lawrence Chan
Neel Nanda
21
96
0
06 Feb 2023
1
2
Next