Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

8 September 2023

Papers citing "Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models"

3 / 3 papers shown

Title
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 227 502 0 28 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 319 11,953 0 04 Mar 2022
A Framework for the Computational Linguistic Analysis of Dehumanization Julia Mendelsohn Yulia Tsvetkov Dan Jurafsky 84 89 0 06 Mar 2020