X-Risk Analysis for AI Research

13 June 2022

Papers citing "X-Risk Analysis for AI Research"

49 / 49 papers shown

Title
The Lock-in Hypothesis: Stagnation by Algorithm Tianyi Qiu Zhonghao He Tejasveer Chugh Max Kleiman-Weiner 59 0 0 06 Jun 2025
From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs Stanley Yu Vaidehi Bulusu Oscar Yasunaga Clayton Lau Cole Blondin Sean O'Brien Kevin Zhu Vasu Sharma 74 0 0 27 May 2025
The Traitors: Deception and Trust in Multi-Agent Language Model Simulations Pedro M. P. Curvo LLMAG 72 0 0 19 May 2025
Establishing Reliability Metrics for Reward Models in Large Language Models Yizhou Chen Yawen Liu Xuesi Wang Qingtao Yu Guangda Huzhang Anxiang Zeng Han Yu Zhiming Zhou 88 0 0 21 Apr 2025
Functional Abstraction of Knowledge Recall in Large Language Models Zijian Wang Chang Xu KELM 68 1 0 20 Apr 2025
LLM-Safety Evaluations Lack Robustness Tim Beyer Sophie Xhonneux Simon Geisler Gauthier Gidel Leo Schwinn Stephan Günnemann ALM ELM 488 2 0 04 Mar 2025
Unraveling Token Prediction Refinement and Identifying Essential Layers in Language Models Jaturong Kongmanee 89 1 0 25 Jan 2025
Two Types of AI Existential Risk: Decisive and Accumulative Atoosa Kasirzadeh 152 18 0 20 Jan 2025
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset Khaoula Chehbouni Jonathan Colaço-Carr Yash More Jackie CK Cheung G. Farnadi 177 1 0 12 Nov 2024
Data-Aware Training Quality Monitoring and Certification for Reliable Deep Learning Farhang Yeganegi Arian Eamaz Mojtaba Soltanalian 53 0 0 14 Oct 2024
SAGE: Scalable Ground Truth Evaluations for Large Sparse Autoencoders Constantin Venhoff Anisoara Calinescu Philip Torr Christian Schroeder de Witt 74 0 0 09 Oct 2024
Speculations on Uncertainty and Humane Algorithms Nicholas Gray 90 0 0 13 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? Richard Ren Steven Basart Adam Khoja Alice Gatti Long Phan ... Alexander Pan Gabriel Mukobi Ryan H. Kim Stephen Fitz Dan Hendrycks ELM 87 25 0 31 Jul 2024
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey Atsuyuki Miyai Jingkang Yang Jingyang Zhang Yifei Ming Sisir Dhakal ... Yixuan Li Hai "Helen" Li Ziwei Liu Toshihiko Yamasaki Kiyoharu Aizawa 141 13 0 31 Jul 2024
Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift Seongho Son William Bankes Sayak Ray Chowdhury Brooks Paige Ilija Bogunovic 131 4 0 26 Jul 2024
FlowCon: Out-of-Distribution Detection using Flow-Based Contrastive Learning Saandeep Aathreya Shaun J. Canavan OODD 77 0 0 03 Jul 2024
WARP: On the Benefits of Weight Averaged Rewarded Policies Alexandre Ramé Johan Ferret Nino Vieillard Robert Dadashi Léonard Hussenot Pierre-Louis Cedoz Pier Giuseppe Sessa Sertan Girgin Arthur Douillard Olivier Bachem 136 23 0 24 Jun 2024
MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts Renchunzi Xie Ambroise Odonnat Vasilii Feofanov Weijian Deng Jianfeng Zhang Bo An 82 2 0 29 May 2024
Mechanistic Interpretability for AI Safety -- A Review Leonard Bereska E. Gavves AI4CE 139 158 0 22 Apr 2024
Holistic Safety and Responsibility Evaluations of Advanced AI Models Laura Weidinger Joslyn Barnhart Jenny Brennan Christina Butterfield Susie Young ... Sebastian Farquhar Lewis Ho Iason Gabriel Allan Dafoe William S. Isaac ELM 90 9 0 22 Apr 2024
SOTOPIA- $π$ : Interactive Learning of Socially Intelligent Language Agents Ruiyi Wang Haofei Yu W. Zhang Zhengyang Qi Maarten Sap Graham Neubig Yonatan Bisk Hao Zhu LLMAG 121 44 0 13 Mar 2024
WARM: On the Benefits of Weight Averaged Reward Models Alexandre Ramé Nino Vieillard Léonard Hussenot Robert Dadashi Geoffrey Cideron Olivier Bachem Johan Ferret 181 104 0 22 Jan 2024
Hazards from Increasingly Accessible Fine-Tuning of Downloadable Foundation Models Alan Chan Ben Bucknall Herbie Bradley David M. Krueger 66 6 0 22 Dec 2023
Survey on AI Ethics: A Socio-technical Perspective Dave Mbiazi Meghana Bhange Maryam Babaei Ivaxi Sheth Patrik Kenfack 99 5 0 28 Nov 2023
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models Michael Lan Phillip H. S. Torr Fazl Barez LRM 68 3 0 07 Nov 2023
ChatGPT v Bard v Bing v Claude 2 v Aria v human-expert. How good are AI chatbots at scientific writing? Edisa Lozić Benjamin Štular 91 35 0 14 Sep 2023
Uncovering the Hidden Cost of Model Compression Diganta Misra Muawiz Chaudhary Agam Goyal Bharat Runwal Pin-Yu Chen VLM 95 0 0 29 Aug 2023
How Good Are LLMs at Out-of-Distribution Detection? Bo Liu Li-Ming Zhan Zexin Lu Yu Feng Lei Xue Xiao-Ming Wu OODD 79 9 0 20 Aug 2023
Frontier AI Regulation: Managing Emerging Risks to Public Safety Markus Anderljung Joslyn Barnhart Anton Korinek Jade Leung Cullen O'Keefe ... Jonas Schuett Yonadav Shavit Divya Siddarth Robert F. Trager Kevin J. Wolf SILM 154 125 0 06 Jul 2023
An Overview of Catastrophic AI Risks Dan Hendrycks Mantas Mazeika Thomas Woodside SILM 82 186 0 21 Jun 2023
Evaluating Superhuman Models with Consistency Checks Lukas Fluri Daniel Paleka Florian Tramèr ELM 124 44 0 16 Jun 2023
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards Alexandre Ramé Guillaume Couairon Mustafa Shukor Corentin Dancette Jean-Baptiste Gaya Laure Soulier Matthieu Cord MoMe 123 158 0 07 Jun 2023
Towards Automated Circuit Discovery for Mechanistic Interpretability Arthur Conmy Augustine N. Mavor-Parker Aengus Lynch Stefan Heimersheim Adrià Garriga-Alonso 68 319 0 28 Apr 2023
Can GPT-4 Perform Neural Architecture Search? Mingkai Zheng Xiu Su Shan You Fei Wang Chao Qian Chang Xu Samuel Albanie ELM 126 59 0 21 Apr 2023
Fairness in AI and Its Long-Term Implications on Society Ondrej Bohdal Timothy M. Hospedales Philip Torr Fazl Barez 104 4 0 16 Apr 2023
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark Alexander Pan Chan Jun Shern Andy Zou Nathaniel Li Steven Basart Thomas Woodside Jonathan Ng Hanlin Zhang Scott Emmons Dan Hendrycks 117 137 0 06 Apr 2023
Machine Psychology: Investigating Emergent Capabilities and Behavior in Large Language Models Using Psychological Methods Thilo Hagendorff LLMAG 135 72 0 24 Mar 2023
Artificial Influence: An Analysis Of AI-Driven Persuasion Matthew Burtell T. Woodside 67 36 0 15 Mar 2023
Ignore Previous Prompt: Attack Techniques For Language Models Fábio Perez Ian Ribeiro SILM 108 452 0 17 Nov 2022
The Expertise Problem: Learning from Specialized Feedback Oliver Daniels-Koch Rachel Freedman OffRL 70 18 0 12 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Kevin Wang Alexandre Variengien Arthur Conmy Buck Shlegeris Jacob Steinhardt 323 563 0 01 Nov 2022
Broken Neural Scaling Laws Ethan Caballero Kshitij Gupta Irina Rish David M. Krueger 153 76 0 26 Oct 2022
Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in Dialog Systems David Gros Yu Li Zhou Yu 90 11 0 22 Oct 2022
How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios Mantas Mazeika Eric Tang Andy Zou Steven Basart Jun Shern Chan Dawn Song David A. Forsyth Jacob Steinhardt Dan Hendrycks 87 10 0 18 Oct 2022
OpenOOD: Benchmarking Generalized Out-of-Distribution Detection Jingkang Yang Pengyun Wang Dejian Zou Zitang Zhou Kun Ding ... Kaiyang Zhou Wayne Zhang Dan Hendrycks Yixuan Li Ziwei Liu OODD 95 245 0 13 Oct 2022
The Alignment Problem from a Deep Learning Perspective Richard Ngo Lawrence Chan Sören Mindermann 141 192 0 30 Aug 2022
Forecasting Future World Events with Neural Networks Andy Zou Tristan Xiao Ryan Jia Joe Kwon Mantas Mazeika Richard Li Dawn Song Jacob Steinhardt Owain Evans Dan Hendrycks 108 27 0 30 Jun 2022
A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges Mohammadreza Salehi Hossein Mirzaei Dan Hendrycks Yixuan Li M. Rohban Mohammad Sabokrou OOD 169 199 0 26 Oct 2021
Generalized Out-of-Distribution Detection: A Survey Jingkang Yang Kaiyang Zhou Yixuan Li Ziwei Liu 311 956 0 21 Oct 2021