32

Challenging the Abilities of Large Language Models in Italian: a Community Initiative

Malvina Nissim
Danilo Croce
Viviana Patti
Pierpaolo Basile
Giuseppe Attanasio
Elio Musacchio
Matteo Rinaldi
Federico Borazio
Maria Francis
Jacopo Gili
Daniel Scalena
Begoña Altuna
Ekhi Azurmendi
Valerio Basile
Luisa Bentivogli
Arianna Bisazza
Marianna Bolognesi
Dominique Brunato
Tommaso Caselli
Silvia Casola
Maria Cassese
Mauro Cettolo
Claudia Collacciani
Leonardo De Cosmo
Maria Pia Di Buono
Andrea Esuli
Julen Etxaniz
Chiara Ferrando
Alessia Fidelangeli
Simona Frenda
Achille Fusco
Marco Gaido
Andrea Galassi
Federico Galli
Luca Giordano
Mattia Goffetti
Itziar Gonzalez-Dios
Lorenzo Gregori
Giulia Grundler
Sandro Iannaccone
Chunyang Jiang
Moreno La Quatra
Francesca Lagioia
Soda Marem Lo
Marco Madeddu
Bernardo Magnini
Raffaele Manna
Fabio Mercorio
Paola Merlo
Arianna Muti
Vivi Nastase
Matteo Negri
Dario Onorati
Elena Palmieri
Sara Papi
Lucia Passaro
Giulia Pensa
Andrea Piergentili
Daniele Potertì
Giovanni Puccetti
Federico Ranaldi
Leonardo Ranaldi
Andrea Amelio Ravelli
Martina Rosola
Elena Sofia Ruzzetti
Giuseppe Samo
Andrea Santilli
Piera Santin
Gabriele Sarti
Giovanni Sartor
Beatrice Savoldi
Antonio Serino
Andrea Seveso
Lucia Siciliani
Paolo Torroni
Rossella Varvara
Andrea Zaninello
Asya Zanollo
Fabio Massimo Zanzotto
Kamyar Zeinalipour
Andrea Zugarini
Main:54 Pages
Bibliography:14 Pages
5 Tables
Appendix:4 Pages
Abstract

The rapid progress of Large Language Models (LLMs) has transformed natural language processing and broadened its impact across research and society. Yet, systematic evaluation of these models, especially for languages beyond English, remains limited. "Challenging the Abilities of LAnguage Models in ITAlian" (CALAMITA) is a large-scale collaborative benchmarking initiative for Italian, coordinated under the Italian Association for Computational Linguistics. Unlike existing efforts that focus on leaderboards, CALAMITA foregrounds methodology: it federates more than 80 contributors from academia, industry, and the public sector to design, document, and evaluate a diverse collection of tasks, covering linguistic competence, commonsense reasoning, factual consistency, fairness, summarization, translation, and code generation. Through this process, we not only assembled a benchmark of over 20 tasks and almost 100 subtasks, but also established a centralized evaluation pipeline that supports heterogeneous datasets and metrics. We report results for four open-weight LLMs, highlighting systematic strengths and weaknesses across abilities, as well as challenges in task-specific evaluation. Beyond quantitative results, CALAMITA exposes methodological lessons: the necessity of fine-grained, task-representative metrics, the importance of harmonized pipelines, and the benefits and limitations of broad community engagement. CALAMITA is conceived as a rolling benchmark, enabling continuous integration of new tasks and models. This makes it both a resource -- the most comprehensive and diverse benchmark for Italian to date -- and a framework for sustainable, community-driven evaluation. We argue that this combination offers a blueprint for other languages and communities seeking inclusive and rigorous LLM evaluation practices.

View on arXiv
Comments on this paper