ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2209.14375
225
500

Improving alignment of dialogue agents via targeted human judgements

28 September 2022
Amelia Glaese
Nat McAleese
Maja Trkebacz
John Aslanides
Vlad Firoiu
Timo Ewalds
Maribeth Rauh
Laura Weidinger
Martin Chadwick
Phoebe Thacker
Lucy Campbell-Gillingham
J. Uesato
Po-Sen Huang
Ramona Comanescu
Fan Yang
A. See
Sumanth Dathathri
Rory Greig
Charlie Chen
Doug Fritz
Jaume Sanchez Elias
Richard Green
Sovna Mokrá
Nicholas Fernando
Boxi Wu
Rachel Foley
Susannah Young
Iason Gabriel
William S. Isaac
John F. J. Mellor
Demis Hassabis
Koray Kavukcuoglu
Lisa Anne Hendricks
G. Irving
    ALM
    AAML
ArXivPDFHTML
Abstract

We present Sparrow, an information-seeking dialogue agent trained to be more helpful, correct, and harmless compared to prompted language model baselines. We use reinforcement learning from human feedback to train our models with two new additions to help human raters judge agent behaviour. First, to make our agent more helpful and harmless, we break down the requirements for good dialogue into natural language rules the agent should follow, and ask raters about each rule separately. We demonstrate that this breakdown enables us to collect more targeted human judgements of agent behaviour and allows for more efficient rule-conditional reward models. Second, our agent provides evidence from sources supporting factual claims when collecting preference judgements over model statements. For factual questions, evidence provided by Sparrow supports the sampled response 78% of the time. Sparrow is preferred more often than baselines while being more resilient to adversarial probing by humans, violating our rules only 8% of the time when probed. Finally, we conduct extensive analyses showing that though our model learns to follow our rules it can exhibit distributional biases.

View on arXiv
Comments on this paper