ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.15043
  4. Cited By
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
v1v2 (latest)

Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
ArXiv (abs)PDFHTMLGithub (3937★)

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

50 / 1,101 papers shown
Title
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Does Representation Intervention Really Identify Desired Concepts and Elicit Alignment?
Hongzheng Yang
Yongqiang Chen
Zeyu Qin
Tongliang Liu
Chaowei Xiao
Kun Zhang
Bo Han
LLMSV
44
0
0
24 May 2025
Security Concerns for Large Language Models: A Survey
Security Concerns for Large Language Models: A Survey
Miles Q. Li
Benjamin C. M. Fung
PILMELM
154
0
0
24 May 2025
The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems
The Silent Saboteur: Imperceptible Adversarial Attacks against Black-Box Retrieval-Augmented Generation Systems
Hongru Song
Yu-an Liu
Ruqing Zhang
Jiafeng Guo
Jianming Lv
Maarten de Rijke
Xueqi Cheng
AAML
38
0
0
24 May 2025
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models
Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models
Jiawei Kong
Hao Fang
Xiaochen Yang
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Yaowei Wang
Min Zhang
AAML
74
0
0
23 May 2025
Speechless: Speech Instruction Training Without Speech for Low Resource Languages
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
Tuan Le Duc Anh
Shreyas Gopal
Yue Heng Yeo
Warren Keng Hoong Low
Eng Siong Chng
J. Yip
SyDa
91
1
0
23 May 2025
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
Zifan Peng
Yule Liu
Zhen Sun
Mingchen Li
Zeren Luo
...
Xinlei He
Xuechao Wang
Yingjie Xue
Shengmin Xu
Xinyi Huang
AuLLMAAML
97
1
0
23 May 2025
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications
EVADE: Multimodal Benchmark for Evasive Content Detection in E-Commerce Applications
Ancheng Xu
Zhihao Yang
Junlin Li
Guanghu Yuan
Longze Chen
...
Zhen Qin
Hengyun Chang
Hamid Alinejad-Rokny
Bo Zheng
Min Yang
AAML
62
0
0
23 May 2025
Does Chain-of-Thought Reasoning Really Reduce Harmfulness from Jailbreaking?
Chengda Lu
Xiaoyu Fan
Yu Huang
Rongwu Xu
Jijie Li
Wei Xu
LRM
66
0
0
23 May 2025
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Chain-of-Lure: A Synthetic Narrative-Driven Approach to Compromise Large Language Models
Wenhan Chang
Tianqing Zhu
Yu Zhao
Shuangyong Song
Ping Xiong
Wanlei Zhou
Yongxiang Li
83
0
0
23 May 2025
Understanding Pre-training and Fine-tuning from Loss Landscape Perspectives
Huanran Chen
Yinpeng Dong
Zeming Wei
Yao Huang
Yichi Zhang
Hang Su
Jun Zhu
MoMe
92
1
0
23 May 2025
Discovering Forbidden Topics in Language Models
Discovering Forbidden Topics in Language Models
Can Rager
Chris Wendler
Rohit Gandikota
David Bau
104
0
0
23 May 2025
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
Linbao Li
Y. Liu
Daojing He
Yu Li
AAML
119
0
0
23 May 2025
CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework
CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework
Viet Pham
Thai Le
SILM
18
0
0
22 May 2025
Robust LLM Fingerprinting via Domain-Specific Watermarks
Robust LLM Fingerprinting via Domain-Specific Watermarks
Thibaud Gloaguen
Robin Staab
Nikola Jovanović
Martin Vechev
WaLM
114
0
0
22 May 2025
CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models
CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models
Zhenzhen Ren
GuoBiao Li
Sheng Li
Zhenxing Qian
Xinpeng Zhang
56
0
0
22 May 2025
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
Kaiwen Zhou
Xuandong Zhao
Gaowen Liu
Jayanth Srinivasa
Aosong Feng
Dawn Song
Xin Eric Wang
LRMLLMSV
99
0
0
22 May 2025
Robustifying Vision-Language Models via Dynamic Token Reweighting
Robustifying Vision-Language Models via Dynamic Token Reweighting
Tanqiu Jiang
Jiacheng Liang
Rongyi Zhu
Jiawei Zhou
Fenglong Ma
Ting Wang
AAML
83
0
0
22 May 2025
In-Context Watermarks for Large Language Models
In-Context Watermarks for Large Language Models
Yepeng Liu
Xuandong Zhao
Christopher Kruegel
Dawn Song
Yuheng Bu
WaLM
90
0
0
22 May 2025
Shape it Up! Restoring LLM Safety during Finetuning
Shape it Up! Restoring LLM Safety during Finetuning
ShengYun Peng
Pin-Yu Chen
Jianfeng Chi
Seongmin Lee
Duen Horng Chau
66
0
0
22 May 2025
Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting
Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models via Automated Adversarial Prompting
Bang Trinh Tran To
Thai Le
MUKELM
91
1
0
22 May 2025
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
Chengcan Wu
Zhixin Zhang
Zeming Wei
Yihao Zhang
Meng Sun
AAML
59
1
0
22 May 2025
Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability
Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability
Punya Syon Pandey
Samuel Simko
Kellin Pelrine
Zhijing Jin
AAML
52
0
0
22 May 2025
Refusal Direction is Universal Across Safety-Aligned Languages
Refusal Direction is Universal Across Safety-Aligned Languages
Xinpeng Wang
Mingyang Wang
Yihong Liu
Hinrich Schutze
Barbara Plank
230
1
0
22 May 2025
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy
Mostafa Elhoushi
Amr Alanwar
LLMSV
58
0
0
22 May 2025
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Csaba Dékány
Stefan Balauca
Robin Staab
Dimitar I. Dimitrov
Martin Vechev
AAML
55
0
0
22 May 2025
Towards medical AI misalignment: a preliminary study
Towards medical AI misalignment: a preliminary study
Barbara Puccio
Federico Castagna
Allan Tucker
Pierangelo Veltri
55
0
0
22 May 2025
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
CTRAP: Embedding Collapse Trap to Safeguard Large Language Models from Harmful Fine-Tuning
Biao Yi
Tiansheng Huang
Baolei Zhang
Tong Li
Lihai Nie
Zheli Liu
Li Shen
MUAAML
78
0
0
22 May 2025
Finetuning-Activated Backdoors in LLMs
Finetuning-Activated Backdoors in LLMs
Thibaud Gloaguen
Mark Vero
Robin Staab
Martin Vechev
AAML
202
0
0
22 May 2025
Advancing LLM Safe Alignment with Safety Representation Ranking
Advancing LLM Safe Alignment with Safety Representation Ranking
Tianqi Du
Zeming Wei
Quan Chen
Chenheng Zhang
Yisen Wang
ALM
77
1
0
21 May 2025
Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses
Alignment Under Pressure: The Case for Informed Adversaries When Evaluating LLM Defenses
Xiaoxue Yang
Bozhidar Stevanoski
Matthieu Meeus
Yves-Alexandre de Montjoye
AAML
49
0
0
21 May 2025
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack
Silvia Cappelletti
Tobia Poppi
Samuele Poppi
Zheng-Xin Yong
Diego Garcia-Olano
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
KELMAAML
59
0
0
21 May 2025
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
Scalable Defense against In-the-wild Jailbreaking Attacks with Safety Context Retrieval
Taiye Chen
Zeming Wei
Ang Li
Yisen Wang
AAML
66
2
0
21 May 2025
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models
Burak Erinç Çetin
Yıldırım Özen
Elif Naz Demiryılmaz
Kaan Engür
Cagri Toraman
ELM
92
0
0
21 May 2025
A Linear Approach to Data Poisoning
A Linear Approach to Data Poisoning
Diego Granziol
Donald Flynn
AAML
192
0
0
21 May 2025
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
Aaron Jiaxun Li
Suraj Srinivas
Usha Bhalla
Himabindu Lakkaraju
AAML
150
0
0
21 May 2025
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
SAFEPATH: Preventing Harmful Reasoning in Chain-of-Thought via Early Alignment
Wonje Jeung
Sangyeon Yoon
Minsuk Kahng
Albert No
LRMLLMSV
198
1
0
20 May 2025
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
AudioJailbreak: Jailbreak Attacks against End-to-End Large Audio-Language Models
Guangke Chen
Fu Song
Zhe Zhao
Xiaojun Jia
Yang Liu
Yanchen Qiao
Weizhe Zhang
AuLLMAAML
113
1
0
20 May 2025
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Safety Subspaces are Not Distinct: A Fine-Tuning Case Study
Kaustubh Ponkshe
Shaan Shah
Raghav Singhal
Praneeth Vepakomma
123
0
0
20 May 2025
Adversarially Pretrained Transformers may be Universally Robust In-Context Learners
Adversarially Pretrained Transformers may be Universally Robust In-Context Learners
Soichiro Kumano
Hiroshi Kera
Toshihiko Yamasaki
AAML
127
0
0
20 May 2025
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs
Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs
Jiawen Wang
Pritha Gupta
Ivan Habernal
Eyke Hüllermeier
SILMAAML
103
1
0
20 May 2025
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Safety Alignment Can Be Not Superficial With Explicit Safety Signals
Jianwei Li
Jung-Eng Kim
AAML
187
1
0
19 May 2025
Web Intellectual Property at Risk: Preventing Unauthorized Real-Time Retrieval by Large Language Models
Web Intellectual Property at Risk: Preventing Unauthorized Real-Time Retrieval by Large Language Models
Yisheng Zhong
Yizhu Wen
Junfeng Guo
Mehran Kafai
Heng Huang
Hanqing Guo
Zhuangdi Zhu
72
0
0
19 May 2025
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations
Li Ji-An
Hua-Dong Xiong
Robert C. Wilson
Marcelo G. Mattar
M. Benna
83
0
0
19 May 2025
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
Investigating the Vulnerability of LLM-as-a-Judge Architectures to Prompt-Injection Attacks
Narek Maloyan
Bislan Ashinov
Dmitry Namiot
AAMLELM
85
0
0
19 May 2025
Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing
Revealing the Deceptiveness of Knowledge Editing: A Mechanistic Analysis of Superficial Editing
Jiakuan Xie
Pengfei Cao
Yubo Chen
Kang Liu
Jun Zhao
KELM
27
0
0
19 May 2025
PromptPrism: A Linguistically-Inspired Taxonomy for Prompts
PromptPrism: A Linguistically-Inspired Taxonomy for Prompts
Sullam Jeoung
Yueyan Chen
Yi Zhang
Shuai Wang
Haibo Ding
Lin Lee Cheong
66
0
0
19 May 2025
BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation
BadNAVer: Exploring Jailbreak Attacks On Vision-and-Language Navigation
Wenqi Lyu
Zerui Li
Yanyuan Qiao
Qi Wu
AAML
68
0
0
18 May 2025
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov
Nurdaulet Mukhituly
Kentaro Inui
Hanan Aldarmaki
Nils Lukas
AAML
87
0
0
18 May 2025
A Survey of Attacks on Large Language Models
A Survey of Attacks on Large Language Models
Wenrui Xu
Keshab K. Parhi
AAMLELM
84
0
0
18 May 2025
Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression
Logic Jailbreak: Efficiently Unlocking LLM Safety Restrictions Through Formal Logical Expression
Jingyu Peng
Maolin Wang
Nan Wang
Xiangyu Zhao
Jiatong Li
Kai Zhang
Qi Liu
70
0
0
18 May 2025
Previous
123456...212223
Next