Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline

16 May 2025

Abstract

This paper presents an end-to-end suite for multilingual information extraction and processing from image-based documents. The system uses Optical Character Recognition (Tesseract) to extract text in languages such as English, Hindi, and Tamil, and then a pipeline involving large language model APIs (Gemini) for cross-lingual translation, abstractive summarization, and re-translation into a target language. Additional modules add sentiment analysis (TensorFlow), topic classification (Transformers), and date extraction (Regex) for better document comprehension. Made available in an accessible Gradio interface, the current research shows a real-world application of libraries, models, and APIs to close the language gap and enhance access to information in image media across different linguistic environments

View on arXiv

@article{madhavi2025_2505.11177,
  title={ Low-Resource Language Processing: An OCR-Driven Summarization and Translation Pipeline },
  author={ Hrishit Madhavi and Jacob Cherian and Yuvraj Khamkar and Dhananjay Bhagat },
  journal={arXiv preprint arXiv:2505.11177},
  year={ 2025 }
}

Comments on this paper