ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.01037
  4. Cited By
Eliciting Latent Knowledge from Quirky Language Models
v1v2v3 (latest)

Eliciting Latent Knowledge from Quirky Language Models

2 December 2023
Alex Troy Mallen
Madeline Brumley
Julia Kharchenko
Nora Belrose
    HILMRALMKELM
ArXiv (abs)PDFHTML

Papers citing "Eliciting Latent Knowledge from Quirky Language Models"

27 / 27 papers shown
Title
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors?
Rohan Gupta
Erik Jenner
34
0
0
17 Jun 2025
Fine-Grained control over Music Generation with Activation Steering
Fine-Grained control over Music Generation with Activation Steering
Dipanshu Panda
Jayden Koshy Joe
Harshith M R
Swathi Narashiman
Pranay Mathur
Anish Veerakumar
Aniruddh Krishna
Keerthiharan A
LLMSV
73
0
0
11 Jun 2025
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Yuntai Bao
Xuhong Zhang
Tianyu Du
Xinkui Zhao
Zhengwen Feng
Hao Peng
Jianwei Yin
HILM
63
0
0
01 Jun 2025
SafeSteer: Interpretable Safety Steering with Refusal-Evasion in LLMs
Shaona Ghosh
Amrita Bhattacharjee
Yftah Ziser
Christopher Parisien
LLMSV
30
0
0
01 Jun 2025
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Kundan Krishna
Joseph Y Cheng
Charles Maalouf
Leon A Gatys
32
0
0
30 May 2025
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models
JailBound: Jailbreaking Internal Safety Boundaries of Vision-Language Models
Jiaxin Song
Yixu Wang
Jie Li
Rui Yu
Yan Teng
Xingjun Ma
Yingchun Wang
AAML
75
0
0
26 May 2025
LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing
LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing
Dario Di Palma
Alessandro De Bellis
Giovanni Servedio
Vito Walter Anelli
Fedelucio Narducci
Tommaso Di Noia
MILM
76
0
0
22 May 2025
Exploring How LLMs Capture and Represent Domain-Specific Knowledge
Exploring How LLMs Capture and Represent Domain-Specific Knowledge
Mirian Hipolito Garcia
Camille Couturier
Daniel Madrigal Diaz
Ankur Mallick
Anastasios Kyrillidis
Robert Sim
Victor Rühle
Saravan Rajmohan
75
1
0
23 Apr 2025
Mechanistic Anomaly Detection for "Quirky" Language Models
Mechanistic Anomaly Detection for "Quirky" Language Models
David Johnston
Arkajyoti Chakraborty
Nora Belrose
78
0
0
09 Apr 2025
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation
Bowen Baker
Joost Huizinga
Leo Gao
Zehao Dou
M. Guan
Aleksander Mądry
Wojciech Zaremba
J. Pachocki
David Farhi
LRM
188
38
0
14 Mar 2025
Research on Superalignment Should Advance Now with Parallel Optimization of Competence and Conformity
HyunJin Kim
Xiaoyuan Yi
Jing Yao
Muhua Huang
Jinyeong Bak
James Evans
Xing Xie
97
0
0
08 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
178
0
0
24 Feb 2025
Does Representation Matter? Exploring Intermediate Layers in Large
  Language Models
Does Representation Matter? Exploring Intermediate Layers in Large Language Models
Oscar Skean
Md Rifat Arefin
Yann LeCun
Ravid Shwartz-Ziv
137
12
0
12 Dec 2024
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Blake Bullwinkel
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
166
18
0
18 Nov 2024
Balancing Label Quantity and Quality for Scalable Elicitation
Balancing Label Quantity and Quality for Scalable Elicitation
Alex Troy Mallen
Nora Belrose
77
2
0
17 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via
  Mechanistic Localization
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
Gintare Karolina Dziugaite
KELMMU
107
10
0
16 Oct 2024
Towards Inference-time Category-wise Safety Steering for Large Language
  Models
Towards Inference-time Category-wise Safety Steering for Large Language Models
Amrita Bhattacharjee
Shaona Ghosh
Traian Rebedea
Christopher Parisien
LLMSV
87
6
0
02 Oct 2024
Cluster-norm for Unsupervised Probing of Knowledge
Cluster-norm for Unsupervised Probing of Knowledge
Walter Laurito
Sharan Maiya
Grégoire Dhimoïla
Owen
Owen Yeung
Kaarel Hänni
61
3
0
26 Jul 2024
Investigating the Indirect Object Identification circuit in Mamba
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign
Adrià Garriga-Alonso
Mamba
76
0
0
19 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
154
27
0
17 Jul 2024
Knowledge Overshadowing Causes Amalgamated Hallucination in Large
  Language Models
Knowledge Overshadowing Causes Amalgamated Hallucination in Large Language Models
Yuji Zhang
Sha Li
Jiateng Liu
Pengfei Yu
Yi R. Fung
Jing Li
Manling Li
Heng Ji
110
12
0
10 Jul 2024
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
Sheridan Feucht
David Atkinson
Byron C. Wallace
David Bau
104
8
0
28 Jun 2024
Monitoring Latent World States in Language Models with Propositional
  Probes
Monitoring Latent World States in Language Models with Propositional Probes
Jiahai Feng
Stuart Russell
Jacob Steinhardt
HILM
89
14
0
27 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
91
31
0
11 Jun 2024
Mechanistic Interpretability for AI Safety -- A Review
Mechanistic Interpretability for AI Safety -- A Review
Leonard Bereska
E. Gavves
AI4CE
139
158
0
22 Apr 2024
Does Transformer Interpretability Transfer to RNNs?
Does Transformer Interpretability Transfer to RNNs?
Gonccalo Paulo
Thomas Marshall
Nora Belrose
85
6
0
09 Apr 2024
A Language Model's Guide Through Latent Space
A Language Model's Guide Through Latent Space
Dimitri von Rutte
Sotiris Anagnostidis
Gregor Bachmann
Thomas Hofmann
111
28
0
22 Feb 2024
1