Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.18496
Cited By
Language Models Represent Beliefs of Self and Others
28 February 2024
Wentao Zhu
Zhining Zhang
Yizhou Wang
MILM
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Language Models Represent Beliefs of Self and Others"
14 / 14 papers shown
Title
The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking
Yuchun Miao
Sen Zhang
Liang Ding
Yuqi Zhang
L. Zhang
Dacheng Tao
81
3
0
31 Jan 2025
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu
Michael Vaiana
Judd Rosenblatt
Cameron Berg
Diogo Schwerz de Lucena
68
0
0
20 Dec 2024
Learning Human-Aware Robot Policies for Adaptive Assistance
Jason Qin
Shikun Ban
Wentao Zhu
Yizhou Wang
Dimitris Samaras
81
0
0
16 Dec 2024
A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios
Xiachong Feng
Longxu Dou
Ella Li
Qinghao Wang
Haoran Wang
Yu Guo
Chang Ma
Lingpeng Kong
LM&Ro
LM&MA
ELM
LLMAG
AI4CE
70
4
0
05 Dec 2024
FairMindSim: Alignment of Behavior, Emotion, and Belief in Humans and LLM Agents Amid Ethical Dilemmas
Yu Lei
Hao Liu
Chengxing Xie
Songjia Liu
Zhiyu Yin
Canyu Chen
Bernard Ghanem
Philip H. S. Torr
Zhen Wu
33
2
0
14 Oct 2024
DynFrs: An Efficient Framework for Machine Unlearning in Random Forest
Shurong Wang
Zhuoyang Shen
Xinbao Qiao
Tongning Zhang
Meng Zhang
MU
21
0
0
02 Oct 2024
Brittle Minds, Fixable Activations: Understanding Belief Representations in Language Models
Matteo Bortoletto
Constantin Ruhdorfer
Lei Shi
Andreas Bulling
AI4MH
LRM
46
5
0
25 Jun 2024
Truth-value judgment in language models: belief directions are context sensitive
Stefan F. Schouten
Peter Bloem
Ilia Markov
Piek Vossen
KELM
71
1
0
29 Apr 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI Xiao Bi
:
Xiao Bi
Deli Chen
Guanting Chen
...
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Qihao Zhu
Yuheng Zou
LRM
ALM
139
306
0
05 Jan 2024
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
298
3,007
0
22 Mar 2023
Toy Models of Superposition
Nelson Elhage
Tristan Hume
Catherine Olsson
Nicholas Schiefer
T. Henighan
...
Sam McCandlish
Jared Kaplan
Dario Amodei
Martin Wattenberg
C. Olah
AAML
MILM
125
318
0
21 Sep 2022
Probing Classifiers: Promises, Shortcomings, and Advances
Yonatan Belinkov
226
405
0
24 Feb 2021
Baby Intuitions Benchmark (BIB): Discerning the goals, preferences, and actions of others
Kanishk Gandhi
Gala Stojnic
Brenden Lake
M. Dillon
48
46
0
23 Feb 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
297
6,959
0
20 Apr 2018
1