ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.13916
  4. Cited By
Unsolved Problems in ML Safety

Unsolved Problems in ML Safety

28 September 2021
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
ArXivPDFHTML

Papers citing "Unsolved Problems in ML Safety"

50 / 55 papers shown
Title
What Is AI Safety? What Do We Want It to Be?
What Is AI Safety? What Do We Want It to Be?
Jacqueline Harding
Cameron Domenico Kirk-Giannini
68
0
0
05 May 2025
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Comparing Uncertainty Measurement and Mitigation Methods for Large Language Models: A Systematic Review
Toghrul Abbasli
Kentaroh Toyoda
Yuan Wang
Leon Witt
Muhammad Asif Ali
Yukai Miao
Dan Li
Qingsong Wei
UQCV
92
0
0
25 Apr 2025
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Tri Nguyen
Lohith Srikanth Pentapalli
Magnus Sieverding
Laurah Turner
Seth Overla
...
Michael Gharib
Matt Kelleher
Michael Shukis
Cameron Pawlik
Kelly Cohen
53
0
0
21 Apr 2025
Adversarial Training of Reward Models
Adversarial Training of Reward Models
Alexander Bukharin
Haifeng Qian
Shengyang Sun
Adithya Renduchintala
Soumye Singhal
Z. Wang
Oleksii Kuchaiev
Olivier Delalleau
T. Zhao
AAML
32
0
0
08 Apr 2025
Superintelligence Strategy: Expert Version
Superintelligence Strategy: Expert Version
Dan Hendrycks
Eric Schmidt
Alexandr Wang
59
1
0
07 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
88
0
0
05 Mar 2025
A statistically consistent measure of Semantic Variability using Language Models
A statistically consistent measure of Semantic Variability using Language Models
Yi Liu
71
0
0
01 Feb 2025
Evolution and The Knightian Blindspot of Machine Learning
Evolution and The Knightian Blindspot of Machine Learning
Joel Lehman
Elliot Meyerson
Tarek El-Gaaly
Kenneth O. Stanley
Tarin Ziyaee
86
1
0
22 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
57
2
0
20 Jan 2025
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
33
6
0
30 Sep 2024
Emergence in Multi-Agent Systems: A Safety Perspective
Emergence in Multi-Agent Systems: A Safety Perspective
Philipp Altmann
Julian Schonberger
Steffen Illium
Maximilian Zorn
Fabian Ritz
Tom Haider
Simon Burton
Thomas Gabor
38
1
0
08 Aug 2024
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment
Exposing Privacy Gaps: Membership Inference Attack on Preference Data for LLM Alignment
Qizhang Feng
Siva Rajesh Kasa
Santhosh Kumar Kasa
Hyokun Yun
C. Teo
S. Bodapati
84
6
0
08 Jul 2024
Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness
Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness
Maayan Ehrenberg
Roy Ganz
Nir Rosenfeld
AAML
45
0
0
17 Jun 2024
CSS: Contrastive Semantic Similarity for Uncertainty Quantification of
  LLMs
CSS: Contrastive Semantic Similarity for Uncertainty Quantification of LLMs
Shuang Ao
Stefan Rueger
Advaith Siddharthan
31
1
0
05 Jun 2024
Preparing for Black Swans: The Antifragility Imperative for Machine
  Learning
Preparing for Black Swans: The Antifragility Imperative for Machine Learning
Ming Jin
36
2
0
18 May 2024
Hybrid Convolutional Neural Networks with Reliability Guarantee
Hybrid Convolutional Neural Networks with Reliability Guarantee
Hans Dermot Doran
Suzana Veljanovska
26
2
0
08 May 2024
Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection
  Method
Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method
Yukun Zhao
Lingyong Yan
Weiwei Sun
Guoliang Xing
Chong Meng
Shuaiqiang Wang
Zhicong Cheng
Zhaochun Ren
Dawei Yin
27
35
0
27 Oct 2023
Assessing Robustness via Score-Based Adversarial Image Generation
Assessing Robustness via Score-Based Adversarial Image Generation
Marcel Kollovieh
Lukas Gosch
Yan Scholten
Marten Lienen
Leo Schwinn
Stephan Günnemann
DiffM
35
4
0
06 Oct 2023
Impact of architecture on robustness and interpretability of
  multispectral deep neural networks
Impact of architecture on robustness and interpretability of multispectral deep neural networks
Charles Godfrey
Elise Bishoff
Myles Mckay
E. Byler
27
0
0
21 Sep 2023
Uncovering the Hidden Cost of Model Compression
Uncovering the Hidden Cost of Model Compression
Diganta Misra
Muawiz Chaudhary
Agam Goyal
Bharat Runwal
Pin-Yu Chen
VLM
33
0
0
29 Aug 2023
Deception Abilities Emerged in Large Language Models
Deception Abilities Emerged in Large Language Models
Thilo Hagendorff
LLMAG
35
75
0
31 Jul 2023
Towards ethical multimodal systems
Towards ethical multimodal systems
Alexis Roger
Esma Aïmeur
Irina Rish
32
3
0
26 Apr 2023
N2G: A Scalable Approach for Quantifying Interpretable Neuron
  Representations in Large Language Models
N2G: A Scalable Approach for Quantifying Interpretable Neuron Representations in Large Language Models
Alex Foote
Neel Nanda
Esben Kran
Ionnis Konstas
Fazl Barez
MILM
20
2
0
22 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
26
126
0
06 Apr 2023
MLTEing Models: Negotiating, Evaluating, and Documenting Model and
  System Qualities
MLTEing Models: Negotiating, Evaluating, and Documenting Model and System Qualities
Katherine R. Maffey
Kyle Dotterrer
Jennifer Niemann
Iain J. Cruickshank
Grace A. Lewis
Christian Kastner
22
4
0
03 Mar 2023
Machine Love
Machine Love
Joel Lehman
20
5
0
18 Feb 2023
SplitOut: Out-of-the-Box Training-Hijacking Detection in Split Learning
  via Outlier Detection
SplitOut: Out-of-the-Box Training-Hijacking Detection in Split Learning via Outlier Detection
Ege Erdogan
Unat Teksen
Mehmet Salih Celiktenyildiz
Alptekin Kupcu
A. E. Cicek
32
4
0
16 Feb 2023
Second Thoughts are Best: Learning to Re-Align With Human Values from
  Text Edits
Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits
Ruibo Liu
Chenyan Jia
Ge Zhang
Ziyu Zhuang
Tony X. Liu
Soroush Vosoughi
90
34
0
01 Jan 2023
Discovering Latent Knowledge in Language Models Without Supervision
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
50
322
0
07 Dec 2022
Beyond Mahalanobis-Based Scores for Textual OOD Detection
Beyond Mahalanobis-Based Scores for Textual OOD Detection
Pierre Colombo
Eduardo Dadalto Camara Gomes
Guillaume Staerman
Nathan Noiry
Pablo Piantanida
OODD
41
5
0
24 Nov 2022
System Safety Engineering for Social and Ethical ML Risks: A Case Study
System Safety Engineering for Social and Ethical ML Risks: A Case Study
Edgar W. Jatho
L. Mailloux
Shalaleh Rismani
Eugene D. Williams
Joshua A. Kroll
23
2
0
08 Nov 2022
Law Informs Code: A Legal Informatics Approach to Aligning Artificial
  Intelligence with Humans
Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans
John J. Nay
ELM
AILaw
86
27
0
14 Sep 2022
The Alignment Problem from a Deep Learning Perspective
The Alignment Problem from a Deep Learning Perspective
Richard Ngo
Lawrence Chan
Sören Mindermann
54
181
0
30 Aug 2022
Exploring the Design of Adaptation Protocols for Improved Generalization
  and Machine Learning Safety
Exploring the Design of Adaptation Protocols for Improved Generalization and Machine Learning Safety
Puja Trivedi
Danai Koutra
Jayaraman J. Thiagarajan
AAML
20
0
0
26 Jul 2022
Threat Model-Agnostic Adversarial Defense using Diffusion Models
Threat Model-Agnostic Adversarial Defense using Diffusion Models
Tsachi Blau
Roy Ganz
Bahjat Kawar
Alex M. Bronstein
Michael Elad
AAML
DiffM
25
26
0
17 Jul 2022
Forecasting Future World Events with Neural Networks
Forecasting Future World Events with Neural Networks
Andy Zou
Tristan Xiao
Ryan Jia
Joe Kwon
Mantas Mazeika
Richard Li
Dawn Song
Jacob Steinhardt
Owain Evans
Dan Hendrycks
22
22
0
30 Jun 2022
Emergent Abilities of Large Language Models
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELM
ReLM
LRM
48
2,333
0
15 Jun 2022
X-Risk Analysis for AI Research
X-Risk Analysis for AI Research
Dan Hendrycks
Mantas Mazeika
27
67
0
13 Jun 2022
Covariance Matrix Adaptation MAP-Annealing
Covariance Matrix Adaptation MAP-Annealing
Matthew C. Fontaine
S. Nikolaidis
40
25
0
22 May 2022
Adversarial Training for High-Stakes Reliability
Adversarial Training for High-Stakes Reliability
Daniel M. Ziegler
Seraphina Nix
Lawrence Chan
Tim Bauman
Peter Schmidt-Nielsen
...
Noa Nabeshima
Benjamin Weinstein-Raun
D. Haas
Buck Shlegeris
Nate Thomas
AAML
30
59
0
03 May 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning
  from Human Feedback
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
72
2,308
0
12 Apr 2022
Self-Supervised Losses for One-Class Textual Anomaly Detection
Self-Supervised Losses for One-Class Textual Anomaly Detection
Kimberly T. Mai
Toby O. Davies
Lewis D. Griffin
8
7
0
12 Apr 2022
A General, Evolution-Inspired Reward Function for Social Robotics
A General, Evolution-Inspired Reward Function for Social Robotics
Thomas Kingsford
16
0
0
01 Feb 2022
The Effects of Reward Misspecification: Mapping and Mitigating
  Misaligned Models
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Alexander Pan
Kush S. Bhatia
Jacob Steinhardt
39
167
0
10 Jan 2022
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
PixMix: Dreamlike Pictures Comprehensively Improve Safety Measures
Dan Hendrycks
Andy Zou
Mantas Mazeika
Leonard Tang
Bo-wen Li
D. Song
Jacob Steinhardt
UQCV
23
136
0
09 Dec 2021
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
What Would Jiminy Cricket Do? Towards Agents That Behave Morally
Dan Hendrycks
Mantas Mazeika
Andy Zou
Sahil Patel
Christine Zhu
Jesus Navarro
D. Song
Bo-wen Li
Jacob Steinhardt
14
58
0
25 Oct 2021
Generalized Out-of-Distribution Detection: A Survey
Generalized Out-of-Distribution Detection: A Survey
Jingkang Yang
Kaiyang Zhou
Yixuan Li
Ziwei Liu
185
875
0
21 Oct 2021
Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and
  Results
Out-of-Distribution Dynamics Detection: RL-Relevant Benchmarks and Results
Mohamad H. Danesh
Alan Fern
94
14
0
11 Jul 2021
Boosting Transferability of Targeted Adversarial Examples via
  Hierarchical Generative Networks
Boosting Transferability of Targeted Adversarial Examples via Hierarchical Generative Networks
Xiao Yang
Yinpeng Dong
Tianyu Pang
Hang Su
Jun Zhu
AAML
31
38
0
05 Jul 2021
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
314
5,773
0
29 Apr 2021
12
Next