Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2407.00869
Cited By
v1
v2
v3 (latest)
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
1 July 2024
Yue Zhou
Henry Peng Zou
Barbara Di Eugenio
Yang Zhang
LRM
HILM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks"
25 / 25 papers shown
Title
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques
Jianing Geng
Biao Yi
Zekun Fei
Tongxi Wu
Lihai Nie
Zheli Liu
AAML
32
0
0
22 May 2025
Adversarial Tokenization
Renato Lui Geh
Zilei Shao
Guy Van den Broeck
SILM
AAML
144
2
0
04 Mar 2025
PersonaGym: Evaluating Persona Agents and LLMs
Vinay Samuel
Henry Peng Zou
Yue Zhou
Shreyas Chaudhari
Ashwin Kalyan
Tanmay Rajpurohit
Ameet Deshpande
Karthik Narasimhan
Vishvak Murahari
LLMAG
107
31
0
25 Jul 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
M. Russinovich
Ahmed Salem
Ronen Eldan
112
98
0
02 Apr 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
124
108
0
19 Feb 2024
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
Zhanke Zhou
Jianing Zhu
Jiangchao Yao
Tongliang Liu
Bo Han
110
189
0
06 Nov 2023
GPT-4 Doesn't Know It's Wrong: An Analysis of Iterative Prompting for Reasoning Problems
Kaya Stechly
Matthew Marquez
Subbarao Kambhampati
LRM
232
89
0
19 Oct 2023
Jailbreaking Black Box Large Language Models in Twenty Queries
Patrick Chao
Alexander Robey
Yan Sun
Hamed Hassani
George J. Pappas
Eric Wong
AAML
139
707
0
12 Oct 2023
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
105
278
0
10 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
129
633
0
05 Oct 2023
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Alexander Robey
Eric Wong
Hamed Hassani
George J. Pappas
AAML
126
257
0
05 Oct 2023
Large Language Models Cannot Self-Correct Reasoning Yet
Jie Huang
Xinyun Chen
Swaroop Mishra
Huaixiu Steven Zheng
Adams Wei Yu
Xinying Song
Denny Zhou
ReLM
LRM
84
487
0
03 Oct 2023
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
119
194
0
06 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Neel Jain
Avi Schwarzschild
Yuxin Wen
Gowthami Somepalli
John Kirchenbauer
Ping Yeh-Chiang
Micah Goldblum
Aniruddha Saha
Jonas Geiping
Tom Goldstein
AAML
144
409
0
01 Sep 2023
Detecting Language Model Attacks with Perplexity
Gabriel Alon
Michael Kamfonas
AAML
118
227
0
27 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
403
12,076
0
18 Jul 2023
Language Models (Mostly) Know What They Know
Saurav Kadavath
Tom Conerly
Amanda Askell
T. Henighan
Dawn Drain
...
Nicholas Joseph
Benjamin Mann
Sam McCandlish
C. Olah
Jared Kaplan
ELM
129
833
0
11 Jul 2022
Large Language Models are Zero-Shot Reasoners
Takeshi Kojima
S. Gu
Machel Reid
Yutaka Matsuo
Yusuke Iwasawa
ReLM
LRM
532
4,522
0
24 May 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
888
13,207
0
04 Mar 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
353
4,598
0
27 Oct 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLM
FaML
194
2,407
0
05 Mar 2021
ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language
Oyvind Tafjord
Bhavana Dalvi
Peter Clark
95
277
0
24 Dec 2020
BPE-Dropout: Simple and Effective Subword Regularization
Ivan Provilkov
Dmitrii Emelianenko
Elena Voita
93
289
0
29 Oct 2019
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
488
1,768
0
18 Sep 2019
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang
Peng Qi
Saizheng Zhang
Yoshua Bengio
William W. Cohen
Ruslan Salakhutdinov
Christopher D. Manning
RALM
212
2,700
0
25 Sep 2018
1