ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.06681
  4. Cited By
Steering Llama 2 via Contrastive Activation Addition

Steering Llama 2 via Contrastive Activation Addition

9 December 2023
Nina Rimsky
Nick Gabrieli
Julian Schulz
Meg Tong
Evan Hubinger
Alexander Matt Turner
    LLMSV
ArXivPDFHTML

Papers citing "Steering Llama 2 via Contrastive Activation Addition"

30 / 130 papers shown
Title
Personality Alignment of Large Language Models
Personality Alignment of Large Language Models
Minjun Zhu
Linyi Yang
Yue Zhang
Yue Zhang
ALM
67
6
0
21 Aug 2024
The Quest for the Right Mediator: A History, Survey, and Theoretical
  Grounding of Causal Interpretability
The Quest for the Right Mediator: A History, Survey, and Theoretical Grounding of Causal Interpretability
Aaron Mueller
Jannik Brinkmann
Millicent Li
Samuel Marks
Koyena Pal
...
Arnab Sen Sharma
Jiuding Sun
Eric Todd
David Bau
Yonatan Belinkov
CML
52
19
0
02 Aug 2024
Investigating the Indirect Object Identification circuit in Mamba
Investigating the Indirect Object Identification circuit in Mamba
Danielle Ensign
Adrià Garriga-Alonso
Mamba
31
0
0
19 Jul 2024
Analyzing the Generalization and Reliability of Steering Vectors
Analyzing the Generalization and Reliability of Steering Vectors
Daniel Tan
David Chanin
Aengus Lynch
Dimitrios Kanoulas
Brooks Paige
Adrià Garriga-Alonso
Robert Kirk
LLMSV
84
17
0
17 Jul 2024
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Model Surgery: Modulating LLM's Behavior Via Simple Parameter Editing
Huanqian Wang
Yang Yue
Rui Lu
Jingxin Shi
Andrew Zhao
Shenzhi Wang
Shiji Song
Gao Huang
LM&Ro
KELM
55
6
0
11 Jul 2024
Interpreting Attention Layer Outputs with Sparse Autoencoders
Interpreting Attention Layer Outputs with Sparse Autoencoders
Connor Kissane
Robert Krzyzanowski
Joseph Isaac Bloom
Arthur Conmy
Neel Nanda
MILM
38
17
0
25 Jun 2024
Multi-property Steering of Large Language Models with Dynamic Activation
  Composition
Multi-property Steering of Large Language Models with Dynamic Activation Composition
Daniel Scalena
Gabriele Sarti
Malvina Nissim
KELM
LLMSV
AI4CE
29
13
0
25 Jun 2024
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Matteo Bortoletto
Constantin Ruhdorfer
Lei Shi
Andreas Bulling
AI4MH
LRM
48
4
0
25 Jun 2024
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in
  LLMs
Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs
Jannik Kossen
Jiatong Han
Muhammed Razzak
Lisa Schut
Shreshth A. Malik
Yarin Gal
HILM
60
35
0
22 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of
  Language Models
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
65
18
0
21 Jun 2024
Who's asking? User personas and the mechanics of latent misalignment
Who's asking? User personas and the mechanics of latent misalignment
Asma Ghandeharioun
Ann Yuan
Marius Guerard
Emily Reif
Michael A. Lepori
Lucas Dixon
LLMSV
44
8
0
17 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
50
138
0
17 Jun 2024
Understanding Jailbreak Success: A Study of Latent Space Dynamics in
  Large Language Models
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
Sarah Ball
Frauke Kreuter
Nina Rimsky
40
13
0
13 Jun 2024
Estimating the Hallucination Rate of Generative AI
Estimating the Hallucination Rate of Generative AI
Andrew Jesson
Nicolas Beltran-Velez
Quentin Chu
Sweta Karlekar
Jannik Kossen
Yarin Gal
John P. Cunningham
David M. Blei
51
6
0
11 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix
  Controller
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
35
0
0
04 Jun 2024
Controlling Large Language Model Agents with Entropic Activation
  Steering
Controlling Large Language Model Agents with Entropic Activation Steering
Nate Rahn
P. DÓro
Marc G. Bellemare
LLMSV
32
6
0
01 Jun 2024
Personalized Steering of Large Language Models: Versatile Steering
  Vectors Through Bi-directional Preference Optimization
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
Yuanpu Cao
Tianrong Zhang
Bochuan Cao
Ziyi Yin
Lu Lin
Fenglong Ma
Jinghui Chen
LLMSV
37
20
0
28 May 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large
  Language Models
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
Sheng-Hsuan Peng
Pin-Yu Chen
Matthew Hull
Duen Horng Chau
50
23
0
27 May 2024
Spectral Editing of Activations for Large Language Model Alignment
Spectral Editing of Activations for Large Language Model Alignment
Yifu Qiu
Zheng Zhao
Yftah Ziser
Anna Korhonen
Edoardo Ponti
Shay B. Cohen
KELM
LLMSV
28
16
0
15 May 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
Joshua Clymer
Caden Juang
Severin Field
CVBM
34
2
0
08 May 2024
DESTEIN: Navigating Detoxification of Language Models via Universal
  Steering Pairs and Head-wise Activation Fusion
DESTEIN: Navigating Detoxification of Language Models via Universal Steering Pairs and Head-wise Activation Fusion
Yu Li
Zhihua Wei
Han Jiang
Chuanyang Gong
LLMSV
29
2
0
16 Apr 2024
Does Transformer Interpretability Transfer to RNNs?
Does Transformer Interpretability Transfer to RNNs?
Gonccalo Paulo
Thomas Marshall
Nora Belrose
63
6
0
09 Apr 2024
Emergent World Models and Latent Variable Estimation in Chess-Playing
  Language Models
Emergent World Models and Latent Variable Estimation in Chess-Playing Language Models
Adam Karvonen
40
19
0
21 Mar 2024
Extending Activation Steering to Broad Skills and Multiple Behaviours
Extending Activation Steering to Broad Skills and Multiple Behaviours
Teun van der Weij
Massimo Poesio
Nandi Schoots
LLMSV
47
12
0
09 Mar 2024
Bias-Augmented Consistency Training Reduces Biased Reasoning in
  Chain-of-Thought
Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought
James Chua
Edward Rees
Hunar Batra
Samuel R. Bowman
Julian Michael
Ethan Perez
Miles Turpin
LRM
47
13
0
08 Mar 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial
  Training
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
57
28
0
08 Mar 2024
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period
  of Large Language Models
Towards Tracing Trustworthiness Dynamics: Revisiting Pre-training Period of Large Language Models
Chao Qian
Jie Zhang
Wei Yao
Dongrui Liu
Zhen-fei Yin
Yu Qiao
Yong Liu
Jing Shao
LLMSV
LRM
57
13
0
29 Feb 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
48
57
0
26 Feb 2024
Learning Interpretable Concepts: Unifying Causal Representation Learning
  and Foundation Models
Learning Interpretable Concepts: Unifying Causal Representation Learning and Foundation Models
Goutham Rajendran
Simon Buchholz
Bryon Aragam
Bernhard Schölkopf
Pradeep Ravikumar
AI4CE
91
21
0
14 Feb 2024
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
301
1,616
0
18 Sep 2019
Previous
123