Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2502.05209
Cited By
v1
v2 (latest)
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
3 February 2025
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
Lev E McKinney
Rohit Gandikota
Aidan Ewart
Domenic Rosati
Zichu Wu
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities"
13 / 63 papers shown
Title
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Joey Tianyi Zhou
KELM
AAML
119
108
0
29 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
293
1,508
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
396
12,044
0
18 Jul 2023
Are aligned neural networks adversarially aligned?
Nicholas Carlini
Milad Nasr
Christopher A. Choquette-Choo
Matthew Jagielski
Irena Gao
...
Pang Wei Koh
Daphne Ippolito
Katherine Lee
Florian Tramèr
Ludwig Schmidt
AAML
67
250
0
26 Jun 2023
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun
Zhuang Liu
Anna Bair
J. Zico Kolter
145
437
0
20 Jun 2023
Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
...
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
110
164
0
24 May 2023
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong
Ruixiang Cui
Yiduo Guo
Yaobo Liang
Shuai Lu
Yanlin Wang
Amin Saied
Weizhu Chen
Nan Duan
ALM
ELM
104
548
0
13 Apr 2023
Continual Learning and Private Unlearning
B. Liu
Qian Liu
Peter Stone
CLL
MU
78
62
0
24 Mar 2022
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRL
AI4TS
AI4CE
ALM
AIMat
490
10,496
0
17 Jun 2021
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
184
4,553
0
07 Sep 2020
Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models
M. Singh
Abhishek Sinha
Nupur Kumari
Harshitha Machiraju
Balaji Krishnamurthy
V. Balasubramanian
AAML
46
61
0
13 May 2019
Regularizing deep networks using efficient layerwise adversarial training
S. Sankaranarayanan
Arpit Jain
Rama Chellappa
Ser Nam Lim
AAML
59
97
0
22 May 2017
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
338
2,898
0
26 Sep 2016
Previous
1
2