Definitions of intent suitable for algorithms

8 June 2021

Papers citing "Definitions of intent suitable for algorithms"

9 / 9 papers shown

Title
Evaluating Language Model Character Traits Francis Rhys Ward Zejia Yang Alex Jackson Randy Brown Chandler Smith Grace Colverd Louis Thomson Raymond Douglas Patrik Bartak Andrew Rowan 47 0 0 05 Oct 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations Teun van der Weij Felix Hofstätter Ollie Jaffe Samuel F. Brown Francis Rhys Ward ELM 52 22 0 11 Jun 2024
The Reasons that Agents Act: Intention and Instrumental Goals Francis Rhys Ward Matt MacDermott Francesco Belardinelli Francesca Toni Tom Everitt AI4CE 29 12 0 11 Feb 2024
Honesty Is the Best Policy: Defining and Mitigating AI Deception Francis Rhys Ward Francesco Belardinelli Francesca Toni Tom Everitt 112 27 0 03 Dec 2023
SHAPE: A Framework for Evaluating the Ethicality of Influence Elfia Bezou-Vrakatseli Benedikt Brückner Luke Thorburn TDI 34 3 0 08 Sep 2023
Experiments with Detecting and Mitigating AI Deception Ismail Sahbane Francis Rhys Ward Henrik ˚Aslund 23 1 0 26 Jun 2023
Human Control: Definitions and Algorithms Ryan Carey Tom Everitt 32 6 0 31 May 2023
What is Proxy Discrimination? Michael Carl Tschantz 6 18 0 11 May 2022
Extending counterfactual accounts of intent to include oblique intent Hal Ashton 13 3 0 07 Jun 2021