Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.03813
Cited By
Improving Activation Steering in Language Models with Mean-Centring
6 December 2023
Ole Jorgensen
Dylan R. Cope
Nandi Schoots
Murray Shanahan
LLMSV
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Improving Activation Steering in Language Models with Mean-Centring"
8 / 8 papers shown
Title
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
Max Tegmark
LRM
80
3
0
05 Mar 2025
Representation Engineering for Large-Language Models: Survey and Research Challenges
Lukasz Bartoszcze
Sarthak Munshi
Bryan Sukidi
Jennifer Yen
Zejia Yang
David Williams-King
Linh Le
Kosi Asuzu
Carsten Maple
102
0
0
24 Feb 2025
SAIF: A Sparse Autoencoder Framework for Interpreting and Steering Instruction Following of Language Models
Z. He
Haiyan Zhao
Yiran Qiao
Fan Yang
Ali Payani
Jing Ma
Mengnan Du
LLMSV
74
5
0
17 Feb 2025
Enhancing Semantic Consistency of Large Language Models through Model Editing: An Interpretability-Oriented Approach
J. Yang
Dapeng Chen
Yajing Sun
Rongjun Li
Zhiyong Feng
Wei Peng
51
5
0
19 Jan 2025
Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Joris Postmus
Steven Abreu
LLMSV
153
1
0
09 Oct 2024
Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution
Haiyan Zhao
Heng Zhao
Bo Shen
Ali Payani
Fan Yang
Mengnan Du
62
3
0
30 Sep 2024
Programming Refusal with Conditional Activation Steering
Bruce W. Lee
Inkit Padhi
Karthikeyan N. Ramamurthy
Erik Miehling
Pierre Dognin
Manish Nagireddy
Amit Dhurandhar
LLMSV
108
15
0
06 Sep 2024
Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories
Tianlong Wang
Xianfeng Jiao
Yifan He
Zhongzhi Chen
Yinghao Zhu
Xu Chu
Junyi Gao
Yasha Wang
Liantao Ma
LLMSV
71
8
0
26 May 2024
1