Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2307.08487
Cited By
v1
v2
v3 (latest)
Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
17 July 2023
Huachuan Qiu
Shuai Zhang
Anqi Li
Hongliang He
Zhenzhong Lan
ALM
Re-assign community
ArXiv (abs)
PDF
HTML
Github (38★)
Papers citing
"Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models"
15 / 15 papers shown
Title
Improving LLM Outputs Against Jailbreak Attacks with Expert Model Integration
Tatia Tsmindashvili
Ana Kolkhidashvili
Dachi Kurtskhalia
Nino Maghlakelidze
Elene Mekvabishvili
Guram Dentoshvili
Orkhan Shamilov
Zaal Gachechiladze
Steven Saporta
David Dachi Choladze
167
0
0
18 May 2025
ChineseSafe: A Chinese Benchmark for Evaluating Safety in Large Language Models
Han Zhang
Hongfu Gao
Qiang Hu
Guanhua Chen
L. Yang
Bingyi Jing
Hongxin Wei
Bing Wang
Haifeng Bai
Lei Yang
AILaw
ELM
122
2
0
24 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
148
1
0
09 Oct 2024
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu
Peiran Li
Edward Suh
Yevgeniy Vorobeychik
Zhuoqing Mao
Somesh Jha
Patrick McDaniel
Huan Sun
Bo Li
Chaowei Xiao
103
29
0
03 Oct 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
141
50
0
31 May 2024
Hallucination Detection in Foundation Models for Decision-Making: A Flexible Definition and Review of the State of the Art
Neeloy Chakraborty
Melkior Ornik
Katherine Driggs-Campbell
LRM
176
12
0
25 Mar 2024
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
Wei Ping
Weixin Chen
Hengzhi Pei
Chulin Xie
Mintong Kang
...
Zinan Lin
Yuk-Kit Cheng
Sanmi Koyejo
Basel Alomair
Yue Liu
95
416
0
20 Jun 2023
Toxicity in ChatGPT: Analyzing Persona-assigned Language Models
Ameet Deshpande
Vishvak Murahari
Tanmay Rajpurohit
Ashwin Kalyan
Karthik Narasimhan
LM&MA
LLMAG
67
365
0
11 Apr 2023
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
Kai Greshake
Sahar Abdelnabi
Shailesh Mishra
C. Endres
Thorsten Holz
Mario Fritz
SILM
126
488
0
23 Feb 2023
Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks
Daniel Kang
Xuechen Li
Ion Stoica
Carlos Guestrin
Matei A. Zaharia
Tatsunori Hashimoto
AAML
92
253
0
11 Feb 2023
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai
Saurav Kadavath
Sandipan Kundu
Amanda Askell
John Kernion
...
Dario Amodei
Nicholas Joseph
Sam McCandlish
Tom B. Brown
Jared Kaplan
SyDa
MoMe
196
1,634
0
15 Dec 2022
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
BigScience Workshop
:
Teven Le Scao
Angela Fan
Christopher Akiki
...
Zhongli Xie
Zifan Ye
M. Bras
Younes Belkada
Thomas Wolf
VLM
386
2,388
0
09 Nov 2022
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
346
1,094
0
05 Oct 2022
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery
Sharan Narang
Jacob Devlin
Maarten Bosma
Gaurav Mishra
...
Kathy Meier-Hellstern
Douglas Eck
J. Dean
Slav Petrov
Noah Fiedel
PILM
LRM
495
6,240
0
05 Apr 2022
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu
Myle Ott
Naman Goyal
Jingfei Du
Mandar Joshi
Danqi Chen
Omer Levy
M. Lewis
Luke Zettlemoyer
Veselin Stoyanov
AIMat
659
24,464
0
26 Jul 2019
1