ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.18540
  4. Cited By
Learning diverse attacks on large language models for robust red-teaming and safety tuning

Learning diverse attacks on large language models for robust red-teaming and safety tuning

28 May 2024
Seanie Lee
Minsu Kim
Lynn Cherif
David Dobre
Juho Lee
Sung Ju Hwang
Kenji Kawaguchi
Gauthier Gidel
Yoshua Bengio
Nikolay Malkin
Moksh Jain
    AAML
ArXivPDFHTML

Papers citing "Learning diverse attacks on large language models for robust red-teaming and safety tuning"

17 / 17 papers shown
Title
Process-Supervised LLM Recommenders via Flow-guided Tuning
Process-Supervised LLM Recommenders via Flow-guided Tuning
Chongming Gao
Mengyao Gao
Chenxiao Fan
Shuai Yuan
Wentao Shi
Xiangnan He
76
2
0
10 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
161
0
0
03 Mar 2025
Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation
Learning to Sample Effective and Diverse Prompts for Text-to-Image Generation
Taeyoung Yun
Dinghuai Zhang
Jinkyoo Park
Ling Pan
DiffM
84
2
0
17 Feb 2025
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
Can Safety Fine-Tuning Be More Principled? Lessons Learned from Cybersecurity
David Williams-King
Linh Le
Adam Oberman
Yoshua Bengio
AAML
54
0
0
19 Jan 2025
Fast Convergence of $Φ$-Divergence Along the Unadjusted Langevin Algorithm and Proximal Sampler
Fast Convergence of ΦΦΦ-Divergence Along the Unadjusted Langevin Algorithm and Proximal Sampler
Siddharth Mitra
Andre Wibisono
52
23
0
14 Oct 2024
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and
  Ethical Considerations
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations
Tarun Raheja
Nilay Pochhi
AAML
46
1
0
09 Oct 2024
Adaptive teachers for amortized samplers
Adaptive teachers for amortized samplers
Minsu Kim
Sanghyeok Choi
Taeyoung Yun
Emmanuel Bengio
Leo Feng
Jarrid Rector-Brooks
Sungsoo Ahn
Jinkyoo Park
Nikolay Malkin
Yoshua Bengio
140
2
0
02 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
67
2
0
02 Oct 2024
Exploring Straightforward Conversational Red-Teaming
Exploring Straightforward Conversational Red-Teaming
George Kour
Naama Zwerdling
Marcel Zalmanovici
Ateret Anaby-Tavor
Ora Nova Fandina
E. Farchi
AAML
122
1
0
07 Sep 2024
Probing the Safety Response Boundary of Large Language Models via Unsafe
  Decoding Path Generation
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
Haoyu Wang
Bingzhe Wu
Yatao Bian
Yongzhe Chang
Xueqian Wang
Peilin Zhao
66
2
0
20 Aug 2024
The Art of Saying No: Contextual Noncompliance in Language Models
The Art of Saying No: Contextual Noncompliance in Language Models
Faeze Brahman
Sachin Kumar
Vidhisha Balachandran
Pradeep Dasigi
Valentina Pyatkin
...
Jack Hessel
Yulia Tsvetkov
Noah A. Smith
Yejin Choi
Hannaneh Hajishirzi
67
20
0
02 Jul 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
Sharath Chandra Raparthy
Andrei Lupu
Eric Hambro
Aram H. Markosyan
...
Minqi Jiang
Jack Parker-Holder
Jakob Foerster
Tim Rocktaschel
Roberta Raileanu
SyDa
70
62
0
26 Feb 2024
QGFN: Controllable Greediness with Action Values
QGFN: Controllable Greediness with Action Values
Elaine Lau
Stephen Zhewen Lu
Ling Pan
Doina Precup
Emmanuel Bengio
111
12
0
07 Feb 2024
Trajectory balance: Improved credit assignment in GFlowNets
Trajectory balance: Improved credit assignment in GFlowNets
Nikolay Malkin
Moksh Jain
Emmanuel Bengio
Chen Sun
Yoshua Bengio
145
166
0
31 Jan 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
355
8,457
0
28 Jan 2022
Analyzing Dynamic Adversarial Training Data in the Limit
Analyzing Dynamic Adversarial Training Data in the Limit
Eric Wallace
Adina Williams
Robin Jia
Douwe Kiela
186
30
0
16 Oct 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
255
4,774
0
24 Feb 2021
1