351
0

Bián: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

Abstract

Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bián}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on BiánBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and models soon atthis https URL.

View on arXiv
@article{jiang2025_2502.19209,
  title={ Bián: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation },
  author={ Zhouyu Jiang and Mengshu Sun and Zhiqiang Zhang and Lei Liang },
  journal={arXiv preprint arXiv:2502.19209},
  year={ 2025 }
}
Comments on this paper

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. See our policy.