ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.10168
49
26

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

20 December 2022
A. Mhaske
Harsh Kedia
Sumanth Doddapaneni
Mitesh M. Khapra
Pratyush Kumar
V. Rudramurthy
Anoop Kunchukuttan
ArXivPDFHTML
Abstract

We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than 808080 for 777 out of 999 test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.

View on arXiv
Comments on this paper