This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to improve text extraction from degraded historical documents. Our key innovation lies in jointly optimizing image clarity and linguistic consistency. First, we generate synthetic image pairs with randomized text fonts, layouts, and degradations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-corrector, fine-tuned on synthetic historical text training pairs, addresses any remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that PreP-OCR pipeline reduces character error rates by 63.9-70.3\% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

View on arXiv

@article{guan2025_2505.20429,
  title={ PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy },
  author={ Shuhao Guan and Moule Lin and Cheng Xu and Xinyi Liu and Jinman Zhao and Jiexin Fan and Qi Xu and Derek Greene },
  journal={arXiv preprint arXiv:2505.20429},
  year={ 2025 }
}

Comments on this paper