ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.13404
12
0

Granary: Speech Recognition and Translation Dataset in 25 European Languages

19 May 2025
Nithin Rao Koluguri
Monica Sekoyan
George Zelenfroynd
Sasha Meister
Shuoyang Ding
Sofia Kostandian
He Huang
Nikolay Karpov
Jagadeesh Balam
Vitaly Lavrukhin
Yifan Peng
Sara Papi
Marco Gaido
Alessio Brutti
Boris Ginsburg
ArXivPDFHTML
Abstract

Multi-task and multilingual approaches benefit large models, yet speech processing for low-resource languages remains underexplored due to data scarcity. To address this, we present Granary, a large-scale collection of speech datasets for recognition and translation across 25 European languages. This is the first open-source effort at this scale for both transcription and translation. We enhance data quality using a pseudo-labeling pipeline with segmentation, two-pass inference, hallucination filtering, and punctuation restoration. We further generate translation pairs from pseudo-labeled transcriptions using EuroLLM, followed by a data filtration pipeline. Designed for efficiency, our pipeline processes vast amount of data within hours. We assess models trained on processed data by comparing their performance on previously curated datasets for both high- and low-resource languages. Our findings show that these models achieve similar performance using approx. 50% less data. Dataset will be made available atthis https URL

View on arXiv
@article{koluguri2025_2505.13404,
  title={ Granary: Speech Recognition and Translation Dataset in 25 European Languages },
  author={ Nithin Rao Koluguri and Monica Sekoyan and George Zelenfroynd and Sasha Meister and Shuoyang Ding and Sofia Kostandian and He Huang and Nikolay Karpov and Jagadeesh Balam and Vitaly Lavrukhin and Yifan Peng and Sara Papi and Marco Gaido and Alessio Brutti and Boris Ginsburg },
  journal={arXiv preprint arXiv:2505.13404},
  year={ 2025 }
}
Comments on this paper