ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.03827
  4. Cited By
Discovering Latent Knowledge in Language Models Without Supervision

Discovering Latent Knowledge in Language Models Without Supervision

7 December 2022
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Discovering Latent Knowledge in Language Models Without Supervision"

50 / 269 papers shown
Title
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An
  Axiomatic Approach
Mechanistically Interpreting a Transformer-based 2-SAT Solver: An Axiomatic Approach
Nils Palumbo
Ravi Mangal
Zifan Wang
Saranya Vijayakumar
Corina S. Pasareanu
Somesh Jha
41
1
0
18 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
16
0
17 Jul 2024
States Hidden in Hidden States: LLMs Emerge Discrete State
  Representations Implicitly
States Hidden in Hidden States: LLMs Emerge Discrete State Representations Implicitly
Junhao Chen
Shengding Hu
Zhiyuan Liu
Maosong Sun
LRM
32
5
0
16 Jul 2024
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in
  Large Language Models Using Only Attention Maps
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang
Linlu Qiu
Cheng-Yu Hsieh
Ranjay Krishna
Yoon Kim
James R. Glass
HILM
18
33
0
09 Jul 2024
Truth is Universal: Robust Detection of Lies in LLMs
Truth is Universal: Robust Detection of Lies in LLMs
Lennart Bürger
Fred Hamprecht
B. Nadler
HILM
43
8
0
03 Jul 2024
Belief Revision: The Adaptability of Large Language Models Reasoning
Belief Revision: The Adaptability of Large Language Models Reasoning
Bryan Wilie
Samuel Cahyawijaya
Etsuko Ishii
Junxian He
Pascale Fung
KELM
LRM
39
1
0
28 Jun 2024
Monitoring Latent World States in Language Models with Propositional
  Probes
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
46
6
0
27 Jun 2024
Confidence Regulation Neurons in Language Models
Confidence Regulation Neurons in Language Models
Alessandro Stolfo
Ben Wu
Wes Gurnee
Yonatan Belinkov
Xingyi Song
Mrinmaya Sachan
Neel Nanda
29
12
0
24 Jun 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in
  LLMs
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen
Jiatong Han
Muhammed Razzak
Lisa Schut
Shreshth A. Malik
Yarin Gal
HILM
58
33
0
22 Jun 2024
Understanding Finetuning for Factual Knowledge Extraction
Understanding Finetuning for Factual Knowledge Extraction
Gaurav R. Ghosal
Tatsunori Hashimoto
Aditi Raghunathan
44
12
0
20 Jun 2024
Factual Confidence of LLMs: on Reliability and Robustness of Current
  Estimators
Factual Confidence of LLMs: on Reliability and Robustness of Current Estimators
Matéo Mahaut
Laura Aina
Paula Czarnowska
Momchil Hardalov
Thomas Müller
Lluís Marquez
HILM
32
11
0
19 Jun 2024
BeHonest: Benchmarking Honesty in Large Language Models
BeHonest: Benchmarking Honesty in Large Language Models
Steffi Chern
Zhulin Hu
Yuqing Yang
Ethan Chern
Yuan Guo
Jiahe Jin
Binjie Wang
Pengfei Liu
HILM
ALM
86
3
0
19 Jun 2024
Enhancing Language Model Factuality via Activation-Based Confidence
  Calibration and Guided Decoding
Enhancing Language Model Factuality via Activation-Based Confidence Calibration and Guided Decoding
Xin Liu
Farima Fatahi Bayat
Lu Wang
31
4
0
19 Jun 2024
Locating and Extracting Relational Concepts in Large Language Models
Locating and Extracting Relational Concepts in Large Language Models
Zijian Wang
Britney White
Chang Xu
KELM
43
0
0
19 Jun 2024
A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning
A Hopfieldian View-based Interpretation for Chain-of-Thought Reasoning
Lijie Hu
Liang Liu
Shu Yang
Xin Chen
Hongru Xiao
Mengdi Li
Pan Zhou
Muhammad Asif Ali
Di Wang
LRM
37
5
0
18 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
41
7
0
17 Jun 2024
InternalInspector $I^2$: Robust Confidence Estimation in LLMs through
  Internal States
InternalInspector I2I^2I2: Robust Confidence Estimation in LLMs through Internal States
Mohammad Beigi
Ying Shen
Runing Yang
Zihao Lin
Qifan Wang
Ankith Mohan
Jianfeng He
Ming Jin
Chang-Tien Lu
Lifu Huang
HILM
36
4
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
50
135
0
17 Jun 2024
Large Language Models Must Be Taught to Know What They Don't Know
Large Language Models Must Be Taught to Know What They Don't Know
Sanyam Kapoor
Nate Gruver
Manley Roberts
Katherine Collins
Arka Pal
Umang Bhatt
Adrian Weller
Samuel Dooley
Micah Goldblum
Andrew Gordon Wilson
36
15
0
12 Jun 2024
Legend: Leveraging Representation Engineering to Annotate Safety Margin
  for Preference Datasets
Legend: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets
Duanyu Feng
Bowen Qin
Chen Huang
Youcheng Huang
Zheng-Wei Zhang
Wenqiang Lei
44
2
0
12 Jun 2024
REAL Sampling: Boosting Factuality and Diversity of Open-Ended
  Generation via Asymptotic Entropy
REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy
Haw-Shiuan Chang
Nanyun Peng
Mohit Bansal
Anil Ramakrishna
Tagyoung Chung
HILM
42
2
0
11 Jun 2024
Estimating the Hallucination Rate of Generative AI
Estimating the Hallucination Rate of Generative AI
Andrew Jesson
Nicolas Beltran-Velez
Quentin Chu
Sweta Karlekar
Jannik Kossen
Yarin Gal
John P. Cunningham
David M. Blei
51
6
0
11 Jun 2024
PaCE: Parsimonious Concept Engineering for Large Language Models
PaCE: Parsimonious Concept Engineering for Large Language Models
Jinqi Luo
Tianjiao Ding
Kwan Ho Ryan Chan
D. Thaker
Aditya Chattopadhyay
Chris Callison-Burch
René Vidal
CVBM
35
7
0
06 Jun 2024
Contrastive Sparse Autoencoders for Interpreting Planning of
  Chess-Playing Agents
Contrastive Sparse Autoencoders for Interpreting Planning of Chess-Playing Agents
Yoann Poupart
34
0
0
06 Jun 2024
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Discovering Bias in Latent Space: An Unsupervised Debiasing Approach
Dyah Adila
Shuai Zhang
Boran Han
Yuyang Wang
AAML
LLMSV
34
6
0
05 Jun 2024
Cycles of Thought: Measuring LLM Confidence through Stable Explanations
Cycles of Thought: Measuring LLM Confidence through Stable Explanations
Evan Becker
Stefano Soatto
42
6
0
05 Jun 2024
To Believe or Not to Believe Your LLM
To Believe or Not to Believe Your LLM
Yasin Abbasi-Yadkori
Ilja Kuzborskij
András György
Csaba Szepesvári
UQCV
59
40
0
04 Jun 2024
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and
  Latent Concept
On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept
Guangliang Liu
Haitao Mao
Bochuan Cao
Zhiyu Xue
K. Johnson
Jiliang Tang
Rongrong Wang
LRM
34
9
0
04 Jun 2024
Position: An Inner Interpretability Framework for AI Inspired by Lessons
  from Cognitive Neuroscience
Position: An Inner Interpretability Framework for AI Inspired by Lessons from Cognitive Neuroscience
Martina G. Vilas
Federico Adolfi
David Poeppel
Gemma Roig
45
5
0
03 Jun 2024
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
40
2
0
03 Jun 2024
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
The Geometry of Categorical and Hierarchical Concepts in Large Language Models
Kiho Park
Yo Joong Choe
Yibo Jiang
Victor Veitch
50
26
0
03 Jun 2024
Standards for Belief Representations in LLMs
Standards for Belief Representations in LLMs
Daniel A. Herrmann
B. Levinstein
39
7
0
31 May 2024
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Reverse Image Retrieval Cues Parametric Memory in Multimodal LLMs
Jialiang Xu
Michael Moor
J. Leskovec
32
2
0
29 May 2024
CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
CtrlA: Adaptive Retrieval-Augmented Generation via Probe-Guided Control
Huanshuo Liu
Hao Zhang
Zhijiang Guo
Kuicai Dong
Xiangyang Li
Yi Quan Lee
Cong Zhang
Yong-jin Liu
3DV
33
1
0
29 May 2024
Efficient Model-agnostic Alignment via Bayesian Persuasion
Efficient Model-agnostic Alignment via Bayesian Persuasion
Fengshuo Bai
Mingzhi Wang
Zhaowei Zhang
Boyuan Chen
Yinda Xu
Ying Wen
Yaodong Yang
50
3
0
29 May 2024
Calibrating Reasoning in Language Models with Internal Consistency
Calibrating Reasoning in Language Models with Internal Consistency
Zhihui Xie
Jizhou Guo
Tong Yu
Shuai Li
LRM
43
8
0
29 May 2024
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty
  in Words?
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
G. Yona
Roee Aharoni
Mor Geva
HILM
44
17
0
27 May 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
61
7
0
26 May 2024
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model
  Against LLM Red-Teaming
Tiny Refinements Elicit Resilience: Toward Efficient Prefix-Model Against LLM Red-Teaming
Jiaxu Liu
Xiangyu Yin
Sihao Wu
Jianhong Wang
Meng Fang
Xinping Yi
Xiaowei Huang
34
4
0
21 May 2024
Spectral Editing of Activations for Large Language Model Alignment
Spectral Editing of Activations for Large Language Model Alignment
Yifu Qiu
Zheng Zhao
Yftah Ziser
Anna Korhonen
E. Ponti
Shay B. Cohen
KELM
LLMSV
28
15
0
15 May 2024
Can Language Models Explain Their Own Classification Behavior?
Can Language Models Explain Their Own Classification Behavior?
Dane Sherburn
Bilal Chughtai
Owain Evans
42
1
0
13 May 2024
An Assessment of Model-On-Model Deception
An Assessment of Model-On-Model Deception
Julius Heitkoetter
Michael Gerovitch
Laker Newhouse
39
3
0
10 May 2024
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
Zorik Gekhman
G. Yona
Roee Aharoni
Matan Eyal
Amir Feder
Roi Reichart
Jonathan Herzig
52
103
0
09 May 2024
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Binary Hypothesis Testing for Softmax Models and Leverage Score Models
Yeqi Gao
Yuzhou Gu
Zhao-quan Song
33
0
0
09 May 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer
Caden Juang
Severin Field
CVBM
34
1
0
08 May 2024
A Causal Explainable Guardrails for Large Language Models
A Causal Explainable Guardrails for Large Language Models
Zhixuan Chu
Yan Wang
Longfei Li
Zhibo Wang
Zhan Qin
Kui Ren
LLMSV
49
7
0
07 May 2024
Enhanced Language Model Truthfulness with Learnable Intervention and
  Uncertainty Expression
Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression
Farima Fatahi Bayat
Xin Liu
H. V. Jagadish
Lu Wang
HILM
KELM
27
2
0
01 May 2024
Truth-value judgment in language models: belief directions are context
  sensitive
Truth-value judgment in language models: belief directions are context sensitive
Stefan F. Schouten
Peter Bloem
Ilia Markov
Piek Vossen
KELM
71
0
0
29 Apr 2024
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised
  Approach
Uncertainty Estimation and Quantification for LLMs: A Simple Supervised Approach
Linyu Liu
Yu Pan
Xiaocheng Li
Guanting Chen
30
24
0
24 Apr 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
40
112
0
22 Apr 2024
Previous
123456
Next