ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2304.03279
  4. Cited By
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards
  and Ethical Behavior in the MACHIAVELLI Benchmark

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

6 April 2023
Alexander Pan
Chan Jun Shern
Andy Zou
Nathaniel Li
Steven Basart
Thomas Woodside
Jonathan Ng
Hanlin Zhang
Scott Emmons
Dan Hendrycks
ArXivPDFHTML

Papers citing "Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark"

50 / 96 papers shown
Title
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Rethinking Prompt Optimizers: From Prompt Merits to Optimization
Zixiao Zhu
Hanzhang Zhou
Zijian Feng
Tianjiao Li
Chua Jia Jim Deryl
Mak Lee Onn
Gee Wah Ng
Kezhi Mao
LRM
34
0
0
15 May 2025
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
OpenDeception: Benchmarking and Investigating AI Deceptive Behaviors via Open-ended Interaction Simulation
Yichen Wu
Xudong Pan
Geng Hong
Min Yang
LLMAG
40
0
0
18 Apr 2025
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games
Persona Dynamics: Unveiling the Impact of Personality Traits on Agents in Text-Based Games
Seungwon Lim
Seungbeen Lee
Dongjun Min
Youngjae Yu
AI4CE
49
0
0
09 Apr 2025
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Seungwon Lim
Sungwoong Kim
Jihwan Yu
Sungjae Lee
Jiwan Chung
Youngjae Yu
71
1
0
18 Mar 2025
DarkBench: Benchmarking Dark Patterns in Large Language Models
Esben Kran
Hieu Minh "Jord" Nguyen
Akash Kundu
Sami Jawhar
Jinsuk Park
Mateusz Maria Jurewicz
59
1
0
13 Mar 2025
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems
Richard Ren
Arunim Agarwal
Mantas Mazeika
Cristina Menghini
Robert Vacareanu
...
Matias Geralnik
Adam Khoja
Dean Lee
Summer Yue
Dan Hendrycks
HILM
ALM
90
0
0
05 Mar 2025
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
Jan Betley
Daniel Tan
Niels Warncke
Anna Sztyber-Betley
Xuchan Bao
Martín Soto
Nathan Labenz
Owain Evans
AAML
80
12
0
24 Feb 2025
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
The Odyssey of the Fittest: Can Agents Survive and Still Be Good?
Dylan Waldner
Risto Miikkulainen
53
0
0
08 Feb 2025
On Memory Construction and Retrieval for Personalized Conversational Agents
On Memory Construction and Retrieval for Personalized Conversational Agents
Zhuoshi Pan
Qianhui Wu
Huiqiang Jiang
Xufang Luo
Hao Cheng
...
Yuqing Yang
Chin-Yew Lin
H. Vicky Zhao
Lili Qiu
Jianfeng Gao
RALM
61
3
0
08 Feb 2025
Will Systems of LLM Agents Cooperate: An Investigation into a Social Dilemma
Richard Willis
Yali Du
Joel Z Leibo
Michael Luck
61
1
0
28 Jan 2025
Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy Measures
Cyber Shadows: Neutralizing Security Threats with AI and Targeted Policy Measures
Marc Schmitt
Pantelis Koutroumpis
48
0
0
03 Jan 2025
Lies, Damned Lies, and Distributional Language Statistics: Persuasion
  and Deception with Large Language Models
Lies, Damned Lies, and Distributional Language Statistics: Persuasion and Deception with Large Language Models
Cameron R. Jones
Benjamin Bergen
69
5
0
22 Dec 2024
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and
  Establishing Best Practices
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
Anka Reuel
Amelia F. Hardy
Chandler Smith
Max Lamparth
Malcolm Hardy
Mykel J. Kochenderfer
ELM
83
17
0
20 Nov 2024
Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment
Allison Huang
Yulu Niki Pi
Carlos Mougan
76
0
0
18 Nov 2024
Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play
Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play
Yifan Zeng
Liang Kairong
Fangzhou Dong
Peijia Zheng
56
0
0
26 Oct 2024
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional
  Supervision
2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision
Shilong Li
Yancheng He
Hui Huang
Xingyuan Bu
Qingbin Liu
Hangyu Guo
Weixun Wang
Jihao Gu
Wenbo Su
Bo Zheng
34
5
0
25 Oct 2024
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Weak-to-Strong Generalization beyond Accuracy: a Pilot Study in Safety, Toxicity, and Legal Reasoning
Ruimeng Ye
Yang Xiao
Bo Hui
ALM
ELM
OffRL
29
2
0
16 Oct 2024
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty
  Simulations
TRIAGE: Ethical Benchmarking of AI Models Through Mass Casualty Simulations
Nathalie Maria Kirch
Konstantin Hebenstreit
Matthias Samwald
35
1
0
10 Oct 2024
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM
  Agent Cyber Offense Capabilities
Catastrophic Cyber Capabilities Benchmark (3CB): Robustly Evaluating LLM Agent Cyber Offense Capabilities
Andrey Anurin
Jonathan Ng
Kibo Schaffer
Jason Schreiber
Esben Kran
ELM
40
5
0
10 Oct 2024
Intuitions of Compromise: Utilitarianism vs. Contractualism
Intuitions of Compromise: Utilitarianism vs. Contractualism
Jared Moore
Yejin Choi
Sydney Levine
38
0
0
07 Oct 2024
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life
Yu Ying Chiu
Liwei Jiang
Yejin Choi
62
3
0
03 Oct 2024
A Looming Replication Crisis in Evaluating Behavior in Language Models?
  Evidence and Solutions
A Looming Replication Crisis in Evaluating Behavior in Language Models? Evidence and Solutions
Laurène Vaugrante
Mathias Niepert
Thilo Hagendorff
LRM
43
1
0
30 Sep 2024
Keeping Humans in the Loop: Human-Centered Automated Annotation with
  Generative AI
Keeping Humans in the Loop: Human-Centered Automated Annotation with Generative AI
Nicholas Pangakis
Samuel Wolken
31
3
0
14 Sep 2024
User-Driven Value Alignment: Understanding Users' Perceptions and
  Strategies for Addressing Biased and Discriminatory Statements in AI
  Companions
User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI Companions
Xianzhe Fan
Qing Xiao
Xuhui Zhou
Jiaxin Pei
Maarten Sap
Zhicong Lu
Hong Shen
58
7
0
01 Sep 2024
PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large
  Language Models
PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models
Alexey Tikhonov
ELM
ReLM
LRM
34
0
0
03 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
31
22
0
31 Jul 2024
Legal Minds, Algorithmic Decisions: How LLMs Apply Constitutional
  Principles in Complex Scenarios
Legal Minds, Algorithmic Decisions: How LLMs Apply Constitutional Principles in Complex Scenarios
Camilla Bignotti
C. Camassa
AILaw
ELM
48
1
0
29 Jul 2024
Reinforcement Learning and Machine ethics:a systematic review
Reinforcement Learning and Machine ethics:a systematic review
Ajay Vishwanath
Louise A. Dennis
Marija Slavkovik
38
1
0
02 Jul 2024
ProgressGym: Alignment with a Millennium of Moral Progress
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu
Yang Zhang
Xuchuan Huang
Jasmine Xinze Li
Yalan Qin
Yaodong Yang
AI4TS
38
4
0
28 Jun 2024
Adversaries Can Misuse Combinations of Safe Models
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
45
7
0
20 Jun 2024
Eliminating Biased Length Reliance of Direct Preference Optimization via
  Down-Sampled KL Divergence
Eliminating Biased Length Reliance of Direct Preference Optimization via Down-Sampled KL Divergence
Junru Lu
Jiazheng Li
Siyu An
Meng Zhao
Yulan He
Di Yin
Xing Sun
47
15
0
16 Jun 2024
Language Models are Alignable Decision-Makers: Dataset and Application
  to the Medical Triage Domain
Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain
Brian Hu
Bill Ray
Alice Leung
Amy Summerville
David Joy
Christopher Funk
Arslan Basharat
33
2
0
10 Jun 2024
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for
  Large Language Models
CRiskEval: A Chinese Multi-Level Risk Evaluation Benchmark Dataset for Large Language Models
Ling Shi
Deyi Xiong
ELM
39
1
0
07 Jun 2024
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial
  Actions across X Community Notes and Wikipedia edits
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits
Tim Franzmeyer
Aleksandar Shtedritski
Samuel Albanie
Philip Torr
João F. Henriques
Jakob N. Foerster
35
1
0
05 Jun 2024
Exploring Human-AI Perception Alignment in Sensory Experiences: Do LLMs
  Understand Textile Hand?
Exploring Human-AI Perception Alignment in Sensory Experiences: Do LLMs Understand Textile Hand?
Shu Zhong
Elia Gatti
Youngjun Cho
Marianna Obrist
57
3
0
05 Jun 2024
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of
  LLM Safeguards
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn
Alexandre Variengien
Charbel-Raphaël Ségerie
Vincent Corruble
32
7
0
03 Jun 2024
Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in
  LLMs
Skin-in-the-Game: Decision Making via Multi-Stakeholder Alignment in LLMs
Bilgehan Sel
Priya Shanmugasundaram
Mohammad Kachuee
Kun Zhou
Ruoxi Jia
Ming Jin
LRM
40
2
0
21 May 2024
Branching Narratives: Character Decision Points Detection
Branching Narratives: Character Decision Points Detection
Alexey Tikhonov
37
1
0
12 May 2024
An Assessment of Model-On-Model Deception
An Assessment of Model-On-Model Deception
Julius Heitkoetter
Michael Gerovitch
Laker Newhouse
42
3
0
10 May 2024
A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended
  Text Worlds
A Mixture-of-Experts Approach to Few-Shot Task Transfer in Open-Ended Text Worlds
Christopher Cui
Xiangyu Peng
Mark O. Riedl
LLMAG
OffRL
MoE
35
1
0
09 May 2024
Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat
  Trick in Legislation
Deception in Reinforced Autonomous Agents: The Unconventional Rabbit Hat Trick in Legislation
Atharvan Dogra
Ameet Deshpande
John Nay
Tanmay Rajpurohit
Ashwin Kalyan
Balaraman Ravindran
34
0
0
07 May 2024
Towards Generalizable Agents in Text-Based Educational Environments: A
  Study of Integrating RL with LLMs
Towards Generalizable Agents in Text-Based Educational Environments: A Study of Integrating RL with LLMs
Bahar Radmehr
Adish Singla
Tanja Kaser
LLMAG
AI4CE
43
6
0
29 Apr 2024
Uncovering Deceptive Tendencies in Language Models: A Simulated Company
  AI Assistant
Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi
Evan Hubinger
42
12
0
25 Apr 2024
Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society
  of LLM Agents
Cooperate or Collapse: Emergence of Sustainable Cooperation in a Society of LLM Agents
Giorgio Piatti
Zhijing Jin
Max Kleiman-Weiner
Bernhard Schölkopf
Mrinmaya Sachan
Rada Mihalcea
LLMAG
60
15
0
25 Apr 2024
Resistance Against Manipulative AI: key factors and possible actions
Resistance Against Manipulative AI: key factors and possible actions
Piotr Wilczyñski
Wiktoria Mieleszczenko-Kowszewicz
P. Biecek
39
3
0
22 Apr 2024
Social Choice Should Guide AI Alignment in Dealing with Diverse Human
  Feedback
Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback
Vincent Conitzer
Rachel Freedman
J. Heitzig
Wesley H. Holliday
Bob M. Jacobs
...
Eric Pacuit
Stuart Russell
Hailey Schoelkopf
Emanuel Tewolde
W. Zwicker
43
30
0
16 Apr 2024
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
Paul Röttger
Fabio Pernisi
Bertie Vidgen
Dirk Hovy
ELM
KELM
60
31
0
08 Apr 2024
Large Language Models as Financial Data Annotators: A Study on
  Effectiveness and Efficiency
Large Language Models as Financial Data Annotators: A Study on Effectiveness and Efficiency
Toyin Aguda
S. Siddagangappa
Elena Kochkina
Simerjot Kaur
Dongsheng Wang
Charese Smiley
Sameena Shah
41
9
0
26 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
54
145
0
05 Mar 2024
Exploring AI Problem Formulation with Children via Teachable Machines
Exploring AI Problem Formulation with Children via Teachable Machines
Utkarsh Dwivedi
Salma Elsayed-Ali
Elizabeth M. Bonsignore
Hernisa Kacorri
46
1
0
28 Feb 2024
12
Next