Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence

1 September 2023

Papers citing "Let the Models Respond: Interpreting Language Model Detoxification Through the Lens of Prompt Dependence"

2 / 2 papers shown

Title
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 239 507 0 28 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 387 12,150 0 04 Mar 2022