ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.03218
  4. Cited By
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

5 March 2024
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
Alice Gatti
Justin D. Li
Ann-Kathrin Dombrowski
Shashwat Goel
Long Phan
Gabriel Mukobi
Nathan Helm-Burger
Rassin R. Lababidi
Lennart Justen
Andrew B. Liu
Michael Chen
Isabelle Barrass
Oliver Zhang
Xiaoyuan Zhu
Rishub Tamirisa
Bhrugu Bharathi
Adam Khoja
Zhenqi Zhao
Ariel Herbert-Voss
Cort B. Breuer
Samuel Marks
Oam Patel
Andy Zou
Mantas Mazeika
Zifan Wang
Palash Oswal
Weiran Liu
Adam A. Hunt
Justin Tienken-Harder
Kevin Y. Shih
Kemper Talley
John Guan
Russell Kaplan
Ian Steneker
David Campbell
Brad Jokubaitis
Alex Levinson
Jean Wang
William Qian
K. Karmakar
Steven Basart
Stephen Fitz
Mindy Levine
Ponnurangam Kumaraguru
U. Tupakula
Vijay Varadharajan
Ruoyu Wang
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
    ELM
ArXivPDFHTML

Papers citing "The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning"

4 / 54 papers shown
Title
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
319
11,953
0
04 Mar 2022
Unsolved Problems in ML Safety
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
186
273
0
28 Sep 2021
Gradient-based Adversarial Attacks against Text Transformers
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
106
227
0
15 Apr 2021
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
286
1,595
0
18 Sep 2019
Previous
12