Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2410.10871
Cited By
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
8 October 2024
Simon Lermen
Mateusz Dziemian
Govind Pimpale
LLMAG
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Applying Refusal-Vector Ablation to Llama 3.1 70B Agents"
4 / 4 papers shown
Title
Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems
Simon Lermen
Mateusz Dziemian
Natalia Pérez-Campanero Antolín
109
0
0
10 Apr 2025
Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems
Pierre Peigne-Lefebvre
Mikolaj Kniejski
Filip Sondej
Matthieu David
J. Hoelscher-Obermaier
Christian Schroeder de Witt
Esben Kran
120
7
0
26 Feb 2025
Steering Language Model Refusal with Sparse Autoencoders
Kyle O'Brien
David Majercak
Xavier Fernandes
Richard Edgar
Blake Bullwinkel
Jingya Chen
Harsha Nori
Dean Carignan
Eric Horvitz
Forough Poursabzi-Sangde
LLMSV
160
18
0
18 Nov 2024
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
Richard Fang
Antony Kellermann
Akul Gupta
Qiusi Zhan
Richard Fang
R. Bindu
Daniel Kang
LLMAG
103
36
0
02 Jun 2024
1