Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.06681
Cited By
Steering Llama 2 via Contrastive Activation Addition
9 December 2023
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Steering Llama 2 via Contrastive Activation Addition"
50 / 130 papers shown
Title
Textual Steering Vectors Can Improve Visual Understanding in Multimodal Large Language Models
Woody Haosheng Gan
Deqing Fu
Julian Asilis
Ollie Liu
Dani Yogatama
Vatsal Sharan
Robin Jia
Willie Neiswanger
LLMSV
17
0
0
20 May 2025
Temporal Alignment of Time Sensitive Facts with Activation Engineering
Sanjay Govindan
Maurice Pagnucco
Yang Song
KELM
LLMSV
AI4CE
26
0
0
20 May 2025
Language Models use Lookbacks to Track Beliefs
Nikhil Prakash
Natalie Shapira
Arnab Sen Sharma
Christoph Riedl
Yonatan Belinkov
Tamar Rott Shaham
David Bau
Atticus Geiger
KELM
12
0
0
20 May 2025
Improving Multilingual Language Models by Aligning Representations through Steering
Omar Mahmoud
B. L. Semage
Thommen George Karimpanal
Santu Rana
LLMSV
9
0
0
19 May 2025
Understanding Cross-Lingual Inconsistency in Large Language Models
Zheng Wei Lim
Alham Fikri Aji
Trevor Cohn
LRM
7
0
0
19 May 2025
Contrastive Prompting Enhances Sentence Embeddings in LLMs through Inference-Time Steering
Zifeng Cheng
Zhonghui Wang
Yuchen Fu
Zhiwei Jiang
Yafeng Yin
Cong Wang
Qing Gu
17
0
0
19 May 2025
Truth Neurons
Haohang Li
Yupeng Cao
Yangyang Yu
Jordan W. Suchow
Zining Zhu
HILM
MILM
KELM
8
0
0
18 May 2025
ExpertSteer: Intervening in LLMs through Expert Knowledge
Weixuan Wang
Minghao Wu
Barry Haddow
Alexandra Birch
LLMSV
7
0
0
18 May 2025
Probing the Vulnerability of Large Language Models to Polysemantic Interventions
Bofan Gong
Shiyang Lai
Dawn Song
AAML
MILM
6
0
0
16 May 2025
Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
Boyi Deng
Boyi Deng
Yidan Zhang
Baosong Yang
Fuli Feng
43
0
0
08 May 2025
Steerable Chatbots: Personalizing LLMs with Preference-Based Activation Steering
Jessica Y. Bo
Tianyu Xu
Ishan Chatterjee
Katrina Passarella-Ward
Achin Kulshrestha
D Shin
LLMSV
87
0
0
07 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
67
0
0
06 May 2025
On the Limitations of Steering in Language Model Alignment
Chebrolu Niranjan
Kokil Jaidka
G. Yeo
LLMSV
43
0
0
02 May 2025
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
81
0
0
27 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
78
1
0
23 Apr 2025
Guillotine: Hypervisors for Isolating Malicious AIs
James Mickens
Sarah Radway
Ravi Netravali
30
0
0
22 Apr 2025
EasyEdit2: An Easy-to-use Steering Framework for Editing Large Language Models
Ziwen Xu
Shuxun Wang
Kewei Xu
Haoming Xu
Mengru Wang
Xinle Deng
Yunzhi Yao
Guozhou Zheng
H. Chen
Ningyu Zhang
KELM
LLMSV
214
0
0
21 Apr 2025
FairSteer: Inference Time Debiasing for LLMs with Dynamic Activation Steering
Heng Chang
Zhiting Fan
Ruizhe Chen
Xiaotang Gai
Luqi Gong
Yan Zhang
Zuozhu Liu
LLMSV
40
1
0
20 Apr 2025
The Geometry of Self-Verification in a Task-Specific Reasoning Model
Andrew Lee
Lihao Sun
Chris Wendler
Fernanda Viégas
Martin Wattenberg
LRM
34
0
0
19 Apr 2025
Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models
Liyi Zhang
Veniamin Veselovsky
R. Thomas McCoy
Thomas Griffiths
61
0
0
17 Apr 2025
On Linear Representations and Pretraining Data Frequency in Language Models
Jack Merullo
Noah A. Smith
Sarah Wiegreffe
Yanai Elazar
44
0
0
16 Apr 2025
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives
Ayoung Lee
Ryan Sungmo Kwon
Peter Railton
Lu Wang
ELM
51
0
0
15 Apr 2025
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models
V. Veselovsky
Berke Argin
Benedikt Stroebl
Chris Wendler
Robert West
James Evans
Thomas L. Griffiths
Arvind Narayanan
60
0
0
14 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
AAML
LLMSV
43
1
0
13 Apr 2025
Activation Patching for Interpretable Steering in Music Generation
Simone Facchiano
Giorgio Strano
Donato Crisostomi
Irene Tallini
Tommaso Mencattini
Fabio Galasso
Emanuele Rodolà
LLMSV
29
0
0
06 Apr 2025
Steering off Course: Reliability Challenges in Steering Language Models
Patrick Queiroz Da Silva
Hari Sethuraman
Dheeraj Rajagopal
Hannaneh Hajishirzi
Sachin Kumar
LLMSV
37
1
0
06 Apr 2025
Towards Understanding and Improving Refusal in Compressed Models via Mechanistic Interpretability
Vishnu Kabir Chhabra
Mohammad Mahdi Khalili
AI4CE
33
0
0
05 Apr 2025
Understanding Aha Moments: from External Observations to Internal Mechanisms
Shu Yang
Junchao Wu
Xin Chen
Yunze Xiao
Xinyi Yang
Derek F. Wong
Di Wang
LRM
35
2
0
03 Apr 2025
LLM Social Simulations Are a Promising Research Method
Jacy Reese Anthis
Ryan Liu
Sean M. Richardson
Austin C. Kozlowski
Bernard Koch
James A. Evans
Erik Brynjolfsson
Michael S. Bernstein
ALM
59
9
0
03 Apr 2025
How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Hongzhe Du
Weikai Li
Min Cai
Karim Saraipour
Zimin Zhang
Himabindu Lakkaraju
Yizhou Sun
Shichang Zhang
KELM
56
0
0
03 Apr 2025
Representation Bending for Large Language Model Safety
Ashkan Yousefpour
Taeheon Kim
Ryan S. Kwon
Seungbeen Lee
Wonje Jeung
Seungju Han
Alvin Wan
Harrison Ngan
Youngjae Yu
Jonghyun Choi
AAML
ALM
KELM
57
1
0
02 Apr 2025
Misaligned Roles, Misplaced Images: Structural Input Perturbations Expose Multimodal Alignment Blind Spots
Erfan Shayegani
G M Shahariar
Sara Abdali
Lei Yu
Nael B. Abu-Ghazaleh
Yue Dong
AAML
78
0
0
01 Apr 2025
The Reasoning-Memorization Interplay in Language Models Is Mediated by a Single Direction
Yihuai Hong
Dian Zhou
Meng Cao
Lei Yu
Zhijing Jin
LRM
48
0
0
29 Mar 2025
Shared Global and Local Geometry of Language Model Embeddings
Andrew Lee
Melanie Weber
F. Viégas
Martin Wattenberg
FedML
79
3
0
27 Mar 2025
Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes
Sharan Maiya
Yinhong Liu
Ramit Debnath
Anna Korhonen
46
0
0
22 Mar 2025
Towards LLM Guardrails via Sparse Representation Steering
Zeqing He
Peng Kuang
Huiyu Xu
Kui Ren
LLMSV
52
1
0
21 Mar 2025
Calibrating Verbal Uncertainty as a Linear Feature to Reduce Hallucinations
Ziwei Ji
L. Yu
Yeskendir Koishekenov
Yejin Bang
Anthony Hartshorn
Alan Schelten
Cheng Zhang
Pascale Fung
Nicola Cancedda
55
1
0
18 Mar 2025
Inference-Time Intervention in Large Language Models for Reliable Requirement Verification
Paul Darm
James Xie
A. Riccardi
46
0
0
18 Mar 2025
Mitigating Memorization in LLMs using Activation Steering
Manan Suri
Nishit Anand
Amisha Bhaskar
LLMSV
57
2
0
08 Mar 2025
Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models
Thomas Winninger
Boussad Addad
Katarzyna Kapusta
AAML
68
0
0
08 Mar 2025
Personalized Text Generation with Contrastive Activation Steering
Jinghao Zhang
Yi Liu
Wei Wang
Qiang Liu
Shu Wu
Liang Wang
Tat-Seng Chua
LLMSV
43
0
0
07 Mar 2025
Shifting Perspectives: Steering Vector Ensembles for Robust Bias Mitigation in LLMs
Zara Siddique
Irtaza Khalid
Liam D. Turner
Luis Espinosa-Anke
LLMSV
63
1
0
07 Mar 2025
Effectively Steer LLM To Follow Preference via Building Confident Directions
Bingqing Song
Boran Han
Shuai Zhang
Hao Wang
Haoyang Fang
Bonan Min
Yuyang Wang
Mingyi Hong
LLMSV
54
0
0
04 Mar 2025
Sensing and Steering Stereotypes: Extracting and Applying Gender Representation Vectors in LLMs
Hannah Cyberey
Yangfeng Ji
David Evans
LLMSV
82
1
0
27 Feb 2025
Investigating Generalization of One-shot LLM Steering Vectors
Jacob Dunefsky
Arman Cohan
LLMSV
39
0
0
26 Feb 2025
The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence
Tom Wollschlager
Jannes Elstner
Simon Geisler
Vincent Cohen-Addad
Stephan Günnemann
Johannes Gasteiger
LLMSV
64
0
0
24 Feb 2025
Steering into New Embedding Spaces: Analyzing Cross-Lingual Alignment Induced by Model Interventions in Multilingual Language Models
Anirudh Sundar
Sinead Williamson
Katherine Metcalf
B. Theobald
Skyler Seto
Masha Fedzechkina
LLMSV
80
0
0
24 Feb 2025
Is Free Self-Alignment Possible?
Dyah Adila
Changho Shin
Yijing Zhang
Frederic Sala
MoMe
118
2
0
24 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
59
0
0
24 Feb 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
102
0
0
24 Feb 2025
1
2
3
Next