Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.22637
Cited By
Understanding (Un)Reliability of Steering Vectors in Language Models
28 May 2025
Joschka Braun
Carsten Eickhoff
David M. Krueger
Seyed Ali Bahrainian
Dmitrii Krasheninnikov
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Understanding (Un)Reliability of Steering Vectors in Language Models"
20 / 20 papers shown
Title
Comparing Bottom-Up and Top-Down Steering Approaches on In-Context Learning Tasks
Madeline Brumley
Joe Kwon
David M. Krueger
Dmitrii Krasheninnikov
Usman Anwar
LLMSV
44
7
0
11 Nov 2024
Towards Reliable Evaluation of Behavior Steering Interventions in LLMs
Itamar Pres
Laura Ruis
Ekdeep Singh Lubana
David M. Krueger
LLMSV
60
6
0
22 Oct 2024
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
89
20
0
17 Jul 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
65
20
0
21 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
63
155
0
17 Jun 2024
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
Yuanpu Cao
Tianrong Zhang
Bochuan Cao
Ziyi Yin
Lu Lin
Fenglong Ma
Jinghui Chen
LLMSV
37
26
0
28 May 2024
A Comprehensive Survey on Process-Oriented Automatic Text Summarization with Exploration of LLM-Based Methods
Hanlei Jin
Yang Zhang
Dan Meng
Jun Wang
Jinghua Tan
83
85
0
05 Mar 2024
Representation Surgery: Theory and Practice of Affine Steering
Shashwat Singh
Shauli Ravfogel
Jonathan Herzig
Roee Aharoni
Ryan Cotterell
Ponnurangam Kumaraguru
LLMSV
42
13
0
15 Feb 2024
Style Vectors for Steering Generative Large Language Model
Kai Konen
Sophie Jentzsch
Diaoulé Diallo
Peer Schutt
Oliver Bensch
Roxanne El Baff
Dominik Opitz
Tobias Hecking
LLMSV
34
16
0
02 Feb 2024
Steering Llama 2 via Contrastive Activation Addition
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
LLMSV
28
181
0
09 Dec 2023
Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects -- A Survey
Ashok Urlana
Pruthwik Mishra
Tathagato Roy
Rahul Mishra
47
9
0
15 Nov 2023
In-Context Learning Creates Task Vectors
Roee Hendel
Mor Geva
Amir Globerson
48
146
0
24 Oct 2023
Linear Representations of Sentiment in Large Language Models
Curt Tigges
Oskar John Hollinsworth
Atticus Geiger
Neel Nanda
MILM
9
82
0
23 Oct 2023
Function Vectors in Large Language Models
Eric Todd
Millicent Li
Arnab Sen Sharma
Aaron Mueller
Byron C. Wallace
David Bau
23
111
0
23 Oct 2023
The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets
Samuel Marks
Max Tegmark
HILM
104
188
0
10 Oct 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Kenneth Li
Oam Patel
Fernanda Viégas
Hanspeter Pfister
Martin Wattenberg
KELM
HILM
60
514
0
06 Jun 2023
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez
Sam Ringer
Kamilė Lukošiūtė
Karina Nguyen
Edwin Chen
...
Danny Hernandez
Deep Ganguli
Evan Hubinger
Nicholas Schiefer
Jared Kaplan
ALM
29
374
0
19 Dec 2022
NEWTS: A Corpus for News Topic-Focused Summarization
Seyed Ali Bahrainian
Sheridan Feucht
Carsten Eickhoff
74
25
0
31 May 2022
Extracting Latent Steering Vectors from Pretrained Language Models
Nishant Subramani
Nivedita Suresh
Matthew E. Peters
LLMSV
46
88
0
10 May 2022
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
D. Song
Jacob Steinhardt
ELM
RALM
105
4,121
0
07 Sep 2020
1