v1v2 (latest)

Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis

23 April 2024

Furui Cheng

Vilém Zouhar

Robin Shing Moon Chan

Daniel Fürst

Hendrik Strobelt

Mennatallah El-Assady

ArXiv (abs)PDF HTML Github

Papers citing "Understanding Large Language Model Behaviors through Interactive Counterfactual Generation and Analysis"

8 / 8 papers shown

Representation Engineering for Large-Language Models: Survey and Research Challenges

555

24 Feb 2025

Interpreting Language Reward Models via Contrastive ExplanationsInternational Conference on Learning Representations (ICLR), 2024

591

25 Nov 2024

Bias in Large Language Models: Origin, Evaluation, and Mitigation

405

101

16 Nov 2024

OCDB: Revisiting Causal Discovery with a Comprehensive Benchmark and Evaluation Framework

Yuanyuan Lin

285

07 Jun 2024

JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models

421

12 Apr 2024

The Language Interpretability Tool: Extensible, Interactive Visualizations and Analysis for NLP ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2020

...

444

213

12 Aug 2020

A Unified Approach to Interpreting Model Predictions

Scott M. Lundberg

Su-In Lee

FAtt

5.2K

32,979

22 May 2017

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

2.7K

21,359

16 Feb 2016