Interpreting Language Reward Models via Contrastive ExplanationsInternational Conference on Learning Representations (ICLR), 2024 |
The Language Interpretability Tool: Extensible, Interactive
Visualizations and Analysis for NLP ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2020 |