ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.05209
  4. Cited By
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
v1v2 (latest)

Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities

3 February 2025
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
Lev E McKinney
Rohit Gandikota
Aidan Ewart
Domenic Rosati
Zichu Wu
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
    MUAAMLELM
ArXiv (abs)PDFHTML

Papers citing "Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities"

13 / 63 papers shown
Title
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending
  Against Extraction Attacks
Can Sensitive Information Be Deleted From LLMs? Objectives for Defending Against Extraction Attacks
Vaidehi Patil
Peter Hase
Joey Tianyi Zhou
KELMAAML
119
108
0
29 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
293
1,508
0
27 Jul 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MHALM
396
12,044
0
18 Jul 2023
Are aligned neural networks adversarially aligned?
Are aligned neural networks adversarially aligned?
Nicholas Carlini
Milad Nasr
Christopher A. Choquette-Choo
Matthew Jagielski
Irena Gao
...
Pang Wei Koh
Daphne Ippolito
Katherine Lee
Florian Tramèr
Ludwig Schmidt
AAML
67
250
0
26 Jun 2023
A Simple and Effective Pruning Approach for Large Language Models
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun
Zhuang Liu
Anna Bair
J. Zico Kolter
145
437
0
20 Jun 2023
Model evaluation for extreme risks
Model evaluation for extreme risks
Toby Shevlane
Sebastian Farquhar
Ben Garfinkel
Mary Phuong
Jess Whittlestone
...
Vijay Bolina
Jack Clark
Yoshua Bengio
Paul Christiano
Allan Dafoe
ELM
110
164
0
24 May 2023
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong
Ruixiang Cui
Yiduo Guo
Yaobo Liang
Shuai Lu
Yanlin Wang
Amin Saied
Weizhu Chen
Nan Duan
ALMELM
104
548
0
13 Apr 2023
Continual Learning and Private Unlearning
Continual Learning and Private Unlearning
B. Liu
Qian Liu
Peter Stone
CLLMU
78
62
0
24 Mar 2022
LoRA: Low-Rank Adaptation of Large Language Models
LoRA: Low-Rank Adaptation of Large Language Models
J. E. Hu
Yelong Shen
Phillip Wallis
Zeyuan Allen-Zhu
Yuanzhi Li
Shean Wang
Lu Wang
Weizhu Chen
OffRLAI4TSAI4CEALMAIMat
490
10,496
0
17 Jun 2021
Measuring Massive Multitask Language Understanding
Measuring Massive Multitask Language Understanding
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELMRALM
184
4,553
0
07 Sep 2020
Harnessing the Vulnerability of Latent Layers in Adversarially Trained
  Models
Harnessing the Vulnerability of Latent Layers in Adversarially Trained Models
M. Singh
Abhishek Sinha
Nupur Kumari
Harshitha Machiraju
Balaji Krishnamurthy
V. Balasubramanian
AAML
46
61
0
13 May 2019
Regularizing deep networks using efficient layerwise adversarial
  training
Regularizing deep networks using efficient layerwise adversarial training
S. Sankaranarayanan
Arpit Jain
Rama Chellappa
Ser Nam Lim
AAML
59
97
0
22 May 2017
Pointer Sentinel Mixture Models
Pointer Sentinel Mixture Models
Stephen Merity
Caiming Xiong
James Bradbury
R. Socher
RALM
338
2,898
0
26 Sep 2016
Previous
12