55
2
v1v2 (latest)

PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy

Main:8 Pages
10 Figures
Bibliography:4 Pages
7 Tables
Appendix:1 Pages
Abstract

This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents. First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors. Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.

View on arXiv
@article{guan2025_2505.20429,
  title={ PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy },
  author={ Shuhao Guan and Moule Lin and Cheng Xu and Xinyi Liu and Jinman Zhao and Jiexin Fan and Qi Xu and Derek Greene },
  journal={arXiv preprint arXiv:2505.20429},
  year={ 2025 }
}
Comments on this paper