Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.04893
Cited By
A Safe Harbor for AI Evaluation and Red Teaming
7 March 2024
Shayne Longpre
Sayash Kapoor
Kevin Klyman
Ashwin Ramaswami
Rishi Bommasani
Borhane Blili-Hamelin
Yangsibo Huang
Aviya Skowron
Zheng-Xin Yong
Suhas Kotha
Yi Zeng
Weiyan Shi
Xianjun Yang
Reid Southen
Alexander Robey
Patrick Chao
Diyi Yang
Ruoxi Jia
Daniel Kang
Sandy Pentland
Arvind Narayanan
Percy Liang
Peter Henderson
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Safe Harbor for AI Evaluation and Red Teaming"
16 / 16 papers shown
Title
Real-World Gaps in AI Governance Research
Ilan Strauss
Isobel Moure
Tim O'Reilly
Sruly Rosenblat
111
1
0
30 Apr 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
113
4
0
25 Apr 2025
Beyond Release: Access Considerations for Generative AI Systems
Irene Solaiman
Rishi Bommasani
Dan Hendrycks
Ariel Herbert-Voss
Yacine Jernite
Aviya Skowron
Andrew Trask
153
1
0
23 Feb 2025
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
139
1
0
09 Oct 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
152
41
0
09 Jun 2024
On the Societal Impact of Open Foundation Models
Sayash Kapoor
Rishi Bommasani
Kevin Klyman
Shayne Longpre
Ashwin Ramaswami
...
Victor Storchan
Daniel Zhang
Daniel E. Ho
Percy Liang
Arvind Narayanan
59
58
0
27 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
101
88
0
25 Jan 2024
How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs
Yi Zeng
Hongpeng Lin
Jingwen Zhang
Diyi Yang
Ruoxi Jia
Weiyan Shi
78
298
0
12 Jan 2024
Detecting Pretraining Data from Large Language Models
Weijia Shi
Anirudh Ajith
Mengzhou Xia
Yangsibo Huang
Daogao Liu
Terra Blevins
Danqi Chen
Luke Zettlemoyer
MIALM
61
185
0
25 Oct 2023
Towards Understanding Sycophancy in Language Models
Mrinank Sharma
Meg Tong
Tomasz Korbak
David Duvenaud
Amanda Askell
...
Oliver Rausch
Nicholas Schiefer
Da Yan
Miranda Zhang
Ethan Perez
284
226
0
20 Oct 2023
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
163
340
0
19 Sep 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
282
1,436
0
27 Jul 2023
Whose Opinions Do Language Models Reflect?
Shibani Santurkar
Esin Durmus
Faisal Ladhak
Cinoo Lee
Percy Liang
Tatsunori Hashimoto
66
426
0
30 Mar 2023
Black Box Adversarial Prompting for Foundation Models
Natalie Maus
Patrick Chao
Eric Wong
Jacob R. Gardner
VLM
54
59
0
08 Feb 2023
Red-Teaming the Stable Diffusion Safety Filter
Javier Rando
Daniel Paleka
David Lindner
Lennard Heim
Florian Tramèr
DiffM
161
195
0
03 Oct 2022
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
436
1,906
0
14 Dec 2020
1