31
0

Assessing the Performance Gap Between Lexical and Semantic Models for Information Retrieval With Formulaic Legal Language

Main:13 Pages
7 Figures
Bibliography:2 Pages
16 Tables
Appendix:1 Pages
Abstract

Legal passage retrieval is an important task that assists legal practitioners in the time-intensive process of finding relevant precedents to support legal arguments. This study investigates the task of retrieving legal passages or paragraphs from decisions of the Court of Justice of the European Union (CJEU), whose language is highly structured and formulaic, leading to repetitive patterns. Understanding when lexical or semantic models are more effective at handling the repetitive nature of legal language is key to developing retrieval systems that are more accurate, efficient, and transparent for specific legal domains. To this end, we explore when this routinized legal language is better suited for retrieval using methods that rely on lexical and statistical features, such as BM25, or dense retrieval models trained to capture semantic and contextual information. A qualitative and quantitative analysis with three complementary metrics shows that both lexical and dense models perform well in scenarios with more repetitive usage of language, whereas BM25 performs better than the dense models in more nuanced scenarios where repetition and verbatim~quotes are less prevalent and in longer queries. Our experiments also show that BM25 is a strong baseline, surpassing off-the-shelf dense models in 4 out of 7 performance metrics. However, fine-tuning a dense model on domain-specific data led to improved performance, surpassing BM25 in most metrics, and we analyze the effect of the amount of data used in fine-tuning on the model's performance and temporal robustness. The code, dataset and appendix related to this work are available on:this https URL.

View on arXiv
@article{mori2025_2506.12895,
  title={ Assessing the Performance Gap Between Lexical and Semantic Models for Information Retrieval With Formulaic Legal Language },
  author={ Larissa Mori and Carlos Sousa de Oliveira and Yuehwern Yih and Mario Ventresca },
  journal={arXiv preprint arXiv:2506.12895},
  year={ 2025 }
}
Comments on this paper