ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.17441
  4. Cited By
Discovering Forbidden Topics in Language Models

Discovering Forbidden Topics in Language Models

23 May 2025
Can Rager
Chris Wendler
Rohit Gandikota
David Bau
ArXivPDFHTML

Papers citing "Discovering Forbidden Topics in Language Models"

32 / 32 papers shown
Title
Eliciting Language Model Behaviors with Investigator Agents
Eliciting Language Model Behaviors with Investigator Agents
Xiang Lisa Li
Neil Chowdhury
Daniel D. Johnson
Tatsunori Hashimoto
Percy Liang
Sarah Schwettmann
Jacob Steinhardt
92
4
0
03 Feb 2025
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
NNsight and NDIF: Democratizing Access to Open-Weight Foundation Model Internals
Jaden Fiotto-Kaufman
Alexander R. Loftus
Eric Todd
Jannik Brinkmann
Caden Juang
...
Carla Brodley
Arjun Guha
Jonathan Bell
Byron C. Wallace
David Bau
66
6
0
18 Jul 2024
The Art of Saying No: Contextual Noncompliance in Language Models
The Art of Saying No: Contextual Noncompliance in Language Models
Faeze Brahman
Sachin Kumar
Vidhisha Balachandran
Pradeep Dasigi
Valentina Pyatkin
...
Jack Hessel
Yulia Tsvetkov
Noah A. Smith
Yejin Choi
Hannaneh Hajishirzi
92
22
0
02 Jul 2024
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially)
  Safer Language Models
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang
Kavel Rao
Seungju Han
Allyson Ettinger
Faeze Brahman
...
Niloofar Mireshghallah
Ximing Lu
Maarten Sap
Yejin Choi
Nouha Dziri
29
58
0
26 Jun 2024
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
  and Refusals of LLMs
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
Seungju Han
Kavel Rao
Allyson Ettinger
Liwei Jiang
Bill Yuchen Lin
Nathan Lambert
Yejin Choi
Nouha Dziri
50
86
0
26 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
78
99
0
10 Jun 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
121
186
0
02 Apr 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming
  and Robust Refusal
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
55
369
0
06 Feb 2024
Black-Box Access is Insufficient for Rigorous AI Audits
Black-Box Access is Insufficient for Rigorous AI Audits
Stephen Casper
Carson Ezell
Charlotte Siegmann
Noam Kolt
Taylor Lynn Curtis
...
Michael Gerovitch
David Bau
Max Tegmark
David M. Krueger
Dylan Hadfield-Menell
AAML
85
84
0
25 Jan 2024
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety
  Training
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Evan Hubinger
Carson E. Denison
Jesse Mu
Mike Lambert
Meg Tong
...
Sören Mindermann
Ryan Greenblatt
Buck Shlegeris
Nicholas Schiefer
Ethan Perez
LLMAG
35
159
0
10 Jan 2024
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Bypassing the Safety Training of Open-Source LLMs with Priming Attacks
Jason Vega
Isha Chaudhary
Changming Xu
Gagandeep Singh
AAML
45
22
0
19 Dec 2023
Evaluating Language-Model Agents on Realistic Autonomous Tasks
Evaluating Language-Model Agents on Realistic Autonomous Tasks
Megan Kinniment
Lucas Jun Koba Sato
Haoxing Du
Brian Goodrich
Max Hasin
...
H. Wijk
Joel Burget
Aaron Ho
Elizabeth Barnes
Paul Christiano
ELM
LLMAG
36
81
0
18 Dec 2023
AI capabilities can be significantly improved without expensive
  retraining
AI capabilities can be significantly improved without expensive retraining
Tom Davidson
Jean-Stanislas Denain
Pablo Villalobos
Guillem Bas
OffRL
VLM
47
27
0
12 Dec 2023
Large Language Models can Strategically Deceive their Users when Put
  Under Pressure
Large Language Models can Strategically Deceive their Users when Put Under Pressure
Jérémy Scheurer
Mikita Balesni
Marius Hobbhahn
LLMAG
48
56
0
09 Nov 2023
Copyright Violations and Large Language Models
Copyright Violations and Large Language Models
Antonia Karamolegkou
Jiaang Li
Li Zhou
Anders Sogaard
37
60
0
20 Oct 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions
AI Deception: A Survey of Examples, Risks, and Potential Solutions
Peter S. Park
Simon Goldstein
Aidan O'Gara
Michael Chen
Dan Hendrycks
48
145
0
28 Aug 2023
Universal and Transferable Adversarial Attacks on Aligned Language
  Models
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
163
1,376
0
27 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
161
928
0
05 Jul 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward
  Model
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
288
3,712
0
29 May 2023
From Text to MITRE Techniques: Exploring the Malicious Use of Large
  Language Models for Generating Cyber Attack Payloads
From Text to MITRE Techniques: Exploring the Malicious Use of Large Language Models for Generating Cyber Attack Payloads
P. Charan
Hrushikesh Chunduri
P. Anand
S. Shukla
45
43
0
24 May 2023
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu
Gelei Deng
Zhengzi Xu
Yuekang Li
Yaowen Zheng
Ying Zhang
Lida Zhao
Tianwei Zhang
Kailong Wang
Yang Liu
56
454
0
23 May 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
42
130
0
06 Apr 2023
Foundation Models and Fair Use
Foundation Models and Fair Use
Peter Henderson
Xuechen Li
Dan Jurafsky
Tatsunori Hashimoto
Mark A. Lemley
Percy Liang
47
121
0
28 Mar 2023
Red-Teaming the Stable Diffusion Safety Filter
Red-Teaming the Stable Diffusion Safety Filter
Javier Rando
Daniel Paleka
David Lindner
Lennard Heim
Florian Tramèr
DiffM
151
192
0
03 Oct 2022
OpenXAI: Towards a Transparent Evaluation of Model Explanations
OpenXAI: Towards a Transparent Evaluation of Model Explanations
Chirag Agarwal
Dan Ley
Satyapriya Krishna
Eshika Saxena
Martin Pawelczyk
Nari Johnson
Isha Puri
Marinka Zitnik
Himabindu Lakkaraju
XAI
45
144
0
22 Jun 2022
Survey of Hallucination in Natural Language Generation
Survey of Hallucination in Natural Language Generation
Ziwei Ji
Nayeon Lee
Rita Frieske
Tiezheng Yu
D. Su
...
Delong Chen
Wenliang Dai
Ho Shu Chan
Andrea Madotto
Pascale Fung
HILM
LRM
177
2,327
0
08 Feb 2022
Finetuned Language Models Are Zero-Shot Learners
Finetuned Language Models Are Zero-Shot Learners
Jason W. Wei
Maarten Bosma
Vincent Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du
Andrew M. Dai
Quoc V. Le
ALM
UQCV
69
3,678
0
03 Sep 2021
Extracting Training Data from Large Language Models
Extracting Training Data from Large Language Models
Nicholas Carlini
Florian Tramèr
Eric Wallace
Matthew Jagielski
Ariel Herbert-Voss
...
Tom B. Brown
D. Song
Ulfar Erlingsson
Alina Oprea
Colin Raffel
MLAU
SILM
415
1,868
0
14 Dec 2020
Linguistic Fingerprints of Internet Censorship: the Case of SinaWeibo
Linguistic Fingerprints of Internet Censorship: the Case of SinaWeibo
Kei Yin Ng
Anna Feldman
Jing Peng
19
6
0
23 Jan 2020
Assessing Post Deletion in Sina Weibo: Multi-modal Classification of Hot
  Topics
Assessing Post Deletion in Sina Weibo: Multi-modal Classification of Hot Topics
Meisam Navaki Arefi
Rajkumar Pandi
Michael Carl Tschantz
Jedidiah R. Crandall
King-wa Fu
Dahlia Qiu Shi
Miao Sha
30
9
0
26 Jun 2019
Tackling Climate Change with Machine Learning
Tackling Climate Change with Machine Learning
David Rolnick
P. Donti
L. Kaack
K. Kochanski
Alexandre Lacoste
...
Demis Hassabis
John C. Platt
F. Creutzig
J. Chayes
Yoshua Bengio
AI4Cl
AI4CE
55
796
0
10 Jun 2019
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
241
18,685
0
20 Jul 2017
1