ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.08576
  4. Cited By
Towards Evaluating AI Systems for Moral Status Using Self-Reports

Towards Evaluating AI Systems for Moral Status Using Self-Reports

14 November 2023
Ethan Perez
Robert Long
    ELM
ArXiv (abs)PDFHTML

Papers citing "Towards Evaluating AI Systems for Moral Status Using Self-Reports"

30 / 30 papers shown
Title
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
Self-Interpretability: LLMs Can Describe Complex Internal Processes that Drive Their Decisions, and Improve with Training
Dillon Plunkett
Adam Morris
Keerthi Reddy
Jorge Morales
MILM
53
0
0
21 May 2025
Consciousness in Artificial Intelligence: Insights from the Science of
  Consciousness
Consciousness in Artificial Intelligence: Insights from the Science of Consciousness
Patrick Butlin
R. Long
Eric Elmoznino
Yoshua Bengio
Jonathan C. P. Birch
...
L. Mudrik
Megan A. K. Peters
Eric Schwitzgebel
Jonathan Simon
Rufin VanRullen
LLMAG
67
105
0
17 Aug 2023
Do Models Explain Themselves? Counterfactual Simulatability of Natural
  Language Explanations
Do Models Explain Themselves? Counterfactual Simulatability of Natural Language Explanations
Yanda Chen
Ruiqi Zhong
Narutatsu Ri
Chen Zhao
He He
Jacob Steinhardt
Zhou Yu
Kathleen McKeown
LRM
87
55
0
17 Jul 2023
Measuring Faithfulness in Chain-of-Thought Reasoning
Measuring Faithfulness in Chain-of-Thought Reasoning
Tamera Lanham
Anna Chen
Ansh Radhakrishnan
Benoit Steiner
Carson E. Denison
...
Zac Hatfield-Dodds
Jared Kaplan
J. Brauner
Sam Bowman
Ethan Perez
ReLMLRM
72
193
0
17 Jul 2023
Simfluence: Modeling the Influence of Individual Training Examples by
  Simulating Training Runs
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs
Kelvin Guu
Albert Webson
Ellie Pavlick
Lucas Dixon
Ian Tenney
Tolga Bolukbasi
TDI
110
36
0
14 Mar 2023
Discovering Latent Knowledge in Language Models Without Supervision
Discovering Latent Knowledge in Language Models Without Supervision
Collin Burns
Haotian Ye
Dan Klein
Jacob Steinhardt
153
386
0
07 Dec 2022
Measuring Progress on Scalable Oversight for Large Language Models
Measuring Progress on Scalable Oversight for Large Language Models
Sam Bowman
Jeeyoon Hyun
Ethan Perez
Edwin Chen
Craig Pettit
...
Tristan Hume
Yuntao Bai
Zac Hatfield-Dodds
Benjamin Mann
Jared Kaplan
ALMELM
81
132
0
04 Nov 2022
Interpretability in the Wild: a Circuit for Indirect Object
  Identification in GPT-2 small
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
316
563
0
01 Nov 2022
Toy Models of Superposition
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAMLMILM
198
380
0
21 Sep 2022
Language Models (Mostly) Know What They Know
Language Models (Mostly) Know What They Know
Saurav Kadavath
Tom Conerly
Amanda Askell
T. Henighan
Dawn Drain
...
Nicholas Joseph
Benjamin Mann
Sam McCandlish
C. Olah
Jared Kaplan
ELM
133
833
0
11 Jul 2022
Emergent Abilities of Large Language Models
Emergent Abilities of Large Language Models
Jason W. Wei
Yi Tay
Rishi Bommasani
Colin Raffel
Barret Zoph
...
Tatsunori Hashimoto
Oriol Vinyals
Percy Liang
J. Dean
W. Fedus
ELMReLMLRM
292
2,518
0
15 Jun 2022
Self-critiquing models for assisting human evaluators
Self-critiquing models for assisting human evaluators
William Saunders
Catherine Yeh
Jeff Wu
Steven Bills
Ouyang Long
Jonathan Ward
Jan Leike
ALMELM
114
306
0
12 Jun 2022
Teaching Models to Express Their Uncertainty in Words
Teaching Models to Express Their Uncertainty in Words
Stephanie C. Lin
Jacob Hilton
Owain Evans
OOD
96
425
0
28 May 2022
Maieutic Prompting: Logically Consistent Reasoning with Recursive
  Explanations
Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations
Jaehun Jung
Lianhui Qin
Sean Welleck
Faeze Brahman
Chandra Bhagavatula
Ronan Le Bras
Yejin Choi
ReLMLRM
273
196
0
24 May 2022
Adversarial Training for High-Stakes Reliability
Adversarial Training for High-Stakes Reliability
Daniel M. Ziegler
Seraphina Nix
Lawrence Chan
Tim Bauman
Peter Schmidt-Nielsen
...
Noa Nabeshima
Benjamin Weinstein-Raun
D. Haas
Buck Shlegeris
Nate Thomas
AAML
126
61
0
03 May 2022
Training Compute-Optimal Large Language Models
Training Compute-Optimal Large Language Models
Jordan Hoffmann
Sebastian Borgeaud
A. Mensch
Elena Buchatskaya
Trevor Cai
...
Karen Simonyan
Erich Elsen
Jack W. Rae
Oriol Vinyals
Laurent Sifre
AI4TS
208
1,987
0
29 Mar 2022
Locating and Editing Factual Associations in GPT
Locating and Editing Factual Associations in GPT
Kevin Meng
David Bau
A. Andonian
Yonatan Belinkov
KELM
251
1,389
0
10 Feb 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&RoLRMAI4CEReLM
853
9,714
0
28 Jan 2022
Show Your Work: Scratchpads for Intermediate Computation with Language
  Models
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye
Anders Andreassen
Guy Gur-Ari
Henryk Michalewski
Jacob Austin
...
Aitor Lewkowycz
Maarten Bosma
D. Luan
Charles Sutton
Augustus Odena
ReLMLRM
183
756
0
30 Nov 2021
Truthful AI: Developing and governing AI that does not lie
Truthful AI: Developing and governing AI that does not lie
Owain Evans
Owen Cotton-Barratt
Lukas Finnveden
Adam Bales
Avital Balwit
Peter Wills
Luca Righetti
William Saunders
HILM
297
117
0
13 Oct 2021
Information-Theoretic Probing with Minimum Description Length
Information-Theoretic Probing with Minimum Description Length
Elena Voita
Ivan Titov
90
276
0
27 Mar 2020
Estimating Training Data Influence by Tracing Gradient Descent
Estimating Training Data Influence by Tracing Gradient Descent
G. Pruthi
Frederick Liu
Mukund Sundararajan
Satyen Kale
TDI
108
419
0
19 Feb 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
651
4,921
0
23 Jan 2020
Fine-Tuning Language Models from Human Preferences
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
491
1,768
0
18 Sep 2019
CTRL: A Conditional Transformer Language Model for Controllable
  Generation
CTRL: A Conditional Transformer Language Model for Controllable Generation
N. Keskar
Bryan McCann
Lav Varshney
Caiming Xiong
R. Socher
AI4CE
132
1,239
0
11 Sep 2019
Supervising strong learners by amplifying weak experts
Supervising strong learners by amplifying weak experts
Paul Christiano
Buck Shlegeris
Dario Amodei
66
124
0
19 Oct 2018
Deep reinforcement learning from human preferences
Deep reinforcement learning from human preferences
Paul Christiano
Jan Leike
Tom B. Brown
Miljan Martic
Shane Legg
Dario Amodei
218
3,377
0
12 Jun 2017
Understanding Black-box Predictions via Influence Functions
Understanding Black-box Predictions via Influence Functions
Pang Wei Koh
Percy Liang
TDI
219
2,910
0
14 Mar 2017
Overcoming catastrophic forgetting in neural networks
Overcoming catastrophic forgetting in neural networks
J. Kirkpatrick
Razvan Pascanu
Neil C. Rabinowitz
J. Veness
Guillaume Desjardins
...
A. Grabska-Barwinska
Demis Hassabis
Claudia Clopath
D. Kumaran
R. Hadsell
CLL
374
7,587
0
02 Dec 2016
The Mythos of Model Interpretability
The Mythos of Model Interpretability
Zachary Chase Lipton
FaML
183
3,716
0
10 Jun 2016
1