Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task

24 October 2022

Papers citing "Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task"

50 / 56 papers shown

Title
Questioning Representational Optimism in Deep Learning: The Fractured Entangled Representation Hypothesis Akarsh Kumar Jeff Clune Joel Lehman Kenneth O. Stanley OOD 21 0 0 16 May 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems Richard Ren Arunim Agarwal Mantas Mazeika Cristina Menghini Robert Vacareanu ... Matias Geralnik Adam Khoja Dean Lee Summer Yue Dan Hendrycks HILM ALM 90 0 0 05 Mar 2025
(How) Do Language Models Track State? Belinda Z. Li Zifan Carl Guo Jacob Andreas LRM 49 0 0 04 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges Lukasz Bartoszcze Sarthak Munshi Bryan Sukidi Jennifer Yen Zejia Yang David Williams-King Linh Le Kosi Asuzu Carsten Maple 102 0 0 24 Feb 2025
The Representation and Recall of Interwoven Structured Knowledge in LLMs: A Geometric and Layered Analysis Ge Lei Samuel J. Cooper KELM 49 0 0 15 Feb 2025
Revisiting Rogers' Paradox in the Context of Human-AI Interaction K. M. Collins Umang Bhatt Ilia Sucholutsky 49 1 0 16 Jan 2025
ICLR: In-Context Learning of Representations Core Francisco Park Andrew Lee Ekdeep Singh Lubana Yongyi Yang Maya Okawa Kento Nishi Martin Wattenberg Hidenori Tanaka AIFin 120 3 0 29 Dec 2024
Transformers Use Causal World Models in Maze-Solving Tasks Alex F Spies William Edwards Michael Ivanitskiy Adrians Skapars Tilman Rauker Katsumi Inoue A. Russo Murray Shanahan 152 1 0 16 Dec 2024
Towards Unifying Interpretability and Control: Evaluation via Intervention Usha Bhalla Suraj Srinivas Asma Ghandeharioun Himabindu Lakkaraju 42 5 0 07 Nov 2024
All or None: Identifiable Linear Properties of Next-token Predictors in Language Modeling Emanuele Marconato Sébastien Lachapelle Sebastian Weichwald Luigi Gresele 69 3 0 30 Oct 2024
Do LLMs "know" internally when they follow instructions? Juyeon Heo Christina Heinze-Deml Oussama Elachqar Shirley Ren Udhay Nallasamy Andy Miller Kwan Ho Ryan Chan Jaya Narain 51 5 0 18 Oct 2024
Systems with Switching Causal Relations: A Meta-Causal Perspective Moritz Willig Tim Nelson Tobiasch Florian Peter Busch Jonas Seng Devendra Singh Dhami Kristian Kersting CML 46 0 0 16 Oct 2024
Analyzing (In)Abilities of SAEs via Formal Languages Abhinav Menon Manish Shrivastava David M. Krueger Ekdeep Singh Lubana 50 7 0 15 Oct 2024
The Geometry of Concepts: Sparse Autoencoder Feature Structure Yuxiao Li Eric J. Michaud David D. Baek Joshua Engels Xiaoqing Sun Max Tegmark 55 7 0 10 Oct 2024
Organizing Unstructured Image Collections using Natural Language Mingxuan Liu Zhun Zhong Jun Li Gianni Franchi Subhankar Roy Elisa Ricci VLM 39 3 0 07 Oct 2024
Interpreting and Editing Vision-Language Representations to Mitigate Hallucinations Nick Jiang Anish Kachinthaya Suzie Petryk Yossi Gandelsman VLM 34 16 0 03 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution Haiyan Zhao Heng Zhao Bo Shen Ali Payani Fan Yang Mengnan Du 59 2 0 30 Sep 2024
Counterfactual Token Generation in Large Language Models Ivi Chatzi N. C. Benz Eleni Straitouri Stratis Tsirtsis Manuel Gomez Rodriguez LRM 34 3 0 25 Sep 2024
Can Transformers Do Enumerative Geometry? Baran Hashemi Roderic G. Corominas Alessandro Giacchetto 44 2 0 27 Aug 2024
Understanding Generative AI Content with Embedding Models Max Vargas Reilly Cannon A. Engel Anand D. Sarwate Tony Chiang 54 3 0 19 Aug 2024
Probabilistic Parameter Estimators and Calibration Metrics for Pose Estimation from Image Features Romeo Valentin Sydney M. Katz Joonghyun Lee Don Walker Matthew Sorgenfrei Mykel J. Kochenderfer 36 0 0 23 Jul 2024
Transformer Circuit Faithfulness Metrics are not Robust Joseph Miller Bilal Chughtai William Saunders 53 7 0 11 Jul 2024
A Text-to-Game Engine for UGC-Based Role-Playing Games Lei Zhang Xuezheng Peng Shuyi Yang Feiyang Wang 37 1 0 11 Jul 2024
Monitoring Latent World States in Language Models with Propositional Probes Jiahai Feng Stuart Russell Jacob Steinhardt HILM 46 8 0 27 Jun 2024
Does ChatGPT Have a Mind? Simon Goldstein B. Levinstein AI4MH LRM 42 5 0 27 Jun 2024
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models Matteo Bortoletto Constantin Ruhdorfer Lei Shi Andreas Bulling AI4MH LRM 48 4 0 25 Jun 2024
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach Dyah Adila Shuai Zhang Boran Han Yuyang Wang AAML LLMSV 34 6 0 05 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models Kiho Park Yo Joong Choe Yibo Jiang Victor Veitch 50 27 0 03 Jun 2024
Standards for Belief Representations in LLMs Daniel A. Herrmann B. Levinstein 42 7 0 31 May 2024
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability Shenyuan Gao Jiazhi Yang Li Chen Kashyap Chitta Yihang Qiu Andreas Geiger Jun Zhang Hongyang Li 71 75 0 27 May 2024
What is it for a Machine Learning Model to Have a Capability? Jacqueline Harding Nathaniel Sharadin ELM 40 3 0 14 May 2024
Test-Time Model Adaptation with Only Forward Passes Shuaicheng Niu Chunyan Miao Guohao Chen Pengcheng Wu Peilin Zhao TTA 43 19 0 02 Apr 2024
A Survey on Large Language Model-Based Game Agents Sihao Hu Tiansheng Huang Gaowen Liu Ramana Rao Kompella Gaowen Liu Selim Furkan Tekin Yichang Xu Zachary Yahn Ling Liu LLMAG LM&Ro AI4CE LM&MA 71 51 0 02 Apr 2024
Language Models Represent Beliefs of Self and Others Wentao Zhu Zhining Zhang Yizhou Wang MILM LRM 50 7 0 28 Feb 2024
Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology Zhenhua Wang Wei Xie Baosheng Wang Enze Wang Zhiwen Gui Shuoyoucheng Ma Kai Chen 36 14 0 24 Feb 2024
Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud Isaac Liao Vedang Lad Ziming Liu Anish Mudide Chloe Loughridge Zifan Carl Guo Tara Rezaei Kheirkhah Mateja Vukelić Max Tegmark 23 12 0 07 Feb 2024
Learning Universal Predictors Jordi Grau-Moya Tim Genewein Marcus Hutter Laurent Orseau Grégoire Delétang ... Anian Ruoss Wenliang Kevin Li Christopher Mattern Matthew Aitchison J. Veness 27 11 0 26 Jan 2024
Labeling Neural Representations with Inverse Recognition Kirill Bykov Laura Kopf Shinichi Nakajima Marius Kloft Marina M.-C. Höhne BDL 29 15 0 22 Nov 2023
Compositional Capabilities of Autoregressive Transformers: A Study on Synthetic, Interpretable Tasks Rahul Ramesh Ekdeep Singh Lubana Mikail Khona Robert P. Dick Hidenori Tanaka CoGe 39 7 0 21 Nov 2023
Predictive Minds: LLMs As Atypical Active Inference Agents Jan Kulveit Clem von Stengel Roman Leventov LLMAG KELM LRM 44 1 0 16 Nov 2023
Divergences between Language Models and Human Brains Yuchen Zhou Emmy Liu Graham Neubig Michael J. Tarr Leila Wehbe 35 1 0 15 Nov 2023
Uncovering Intermediate Variables in Transformers using Circuit Probing Michael A. Lepori Thomas Serre Ellie Pavlick 75 7 0 07 Nov 2023
Language Models Represent Space and Time Wes Gurnee Max Tegmark 47 142 0 03 Oct 2023
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods Fred Zhang Neel Nanda LLMSV 36 100 0 27 Sep 2023
Separate the Wheat from the Chaff: Model Deficiency Unlearning via Parameter-Efficient Module Operation Xinshuo Hu Dongfang Li Baotian Hu Zihao Zheng Zhenyu Liu Hao Fei KELM MU 33 26 0 16 Aug 2023
Domain-specific ChatBots for Science using Embeddings Kevin G. Yager 32 8 0 15 Jun 2023
Passive learning of active causal strategies in agents and language models Andrew Kyle Lampinen Stephanie C. Y. Chan Ishita Dasgupta A. Nam Jane X. Wang 29 15 0 25 May 2023
The Vector Grounding Problem Dimitri Coelho Mollo Raphael Milliere 44 26 0 04 Apr 2023
Eliciting Latent Predictions from Transformers with the Tuned Lens Nora Belrose Zach Furman Logan Smith Danny Halawi Igor V. Ostrovsky Lev McKinney Stella Biderman Jacob Steinhardt 22 194 0 14 Mar 2023
A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations Bilal Chughtai Lawrence Chan Neel Nanda 21 96 0 06 Feb 2023