ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.02167
54
1

Multilingual Attribute Extraction from News Web Pages

4 February 2025
Pavel Bedrin
Maksim Varlamov
Alexander Yatskov
ArXivPDFHTML
Abstract

This paper addresses the challenge of automatically extracting attributes from news article web pages across multiple languages. Recent neural network models have shown high efficacy in extracting information from semi-structured web pages. However, these models are predominantly applied to domains like e-commerce and are pre-trained using English data, complicating their application to web pages in other languages. We prepared a multilingual dataset comprising 3,172 marked-up news web pages across six languages (English, German, Russian, Chinese, Korean, and Arabic) from 161 websites. The dataset is publicly available on GitHub. We fine-tuned the pre-trained state-of-the-art model, MarkupLM, to extract news attributes from these pages and evaluated the impact of translating pages into English on extraction quality. Additionally, we pre-trained another state-of-the-art model, DOM-LM, on multilingual data and fine-tuned it on our dataset. We compared both fine-tuned models to existing open-source news data extraction tools, achieving superior extraction metrics.

View on arXiv
@article{bedrin2025_2502.02167,
  title={ Multilingual Attribute Extraction from News Web Pages },
  author={ Pavel Bedrin and Maksim Varlamov and Alexander Yatskov },
  journal={arXiv preprint arXiv:2502.02167},
  year={ 2025 }
}
Comments on this paper