Advances and Limitations in Open Source Arabic-Script OCR: A Case Study

8 February 2024

Abstract

This work presents an accuracy study of the open source OCR engine, Kraken, on the leading Arabic scholarly journal, al-Abhath. In contrast with other commercially available OCR engines, Kraken is shown to be capable of producing highly accurate Arabic-script OCR. The study also assesses the relative accuracy of typeface-specific and generalized models on the al-Abhath data and provides a microanalysis of the ``error instances'' and the contextual features that may have contributed to OCR misrecognition. Building on this analysis, the paper argues that Arabic-script OCR can be significantly improved through (1) a more systematic approach to training data production, and (2) the development of key technological components, especially multi-language models and improved line segmentation and layout analysis. Cet article pr{\é}sente une {\é}tude déxactitude du moteur ROC open source, Krakan, sur la revue acad{\é}mique arabe de premier rang, al-Abhath. Contrairement {\`a} dáutres moteurs ROC disponibles sur le march{\é}, Kraken se r{\é}v{\`e}le {\^e}tre capable de produire de la ROC extr{\^e}mement exacte de l'{\é}criture arabe. L'{\é}tude {\é}value aussi léxactitude relative des mod{\`e}les sp{\é}cifiquement configur{\é}s {\`a} des polices et celle des mod{\`e}les g{\é}n{\é}ralis{\é}s sur les donn{\é}es dál-Abhath et fournit une microanalyse des "occurrences dérreurs", ainsi quúne microanalyse des {\é}l{\é}ments contextuels qui pourraient avoir contribu{\é} {\`a} la m{\é}reconnaissance ROC. Sáppuyant sur cette analyse, cet article fait valoir que la ROC de l'{\é}criture arabe peut {\^e}tre consid{\é}rablement am{\é}lior{\é}e gr{\^a}ce {\`a} (1) une approche plus syst{\é}matique déntra{\^i}nement de la production de donn{\é}es et (2) gr{\^a}ce au d{\é}veloppement de composants technologiques fondamentaux, notammentlám{\é}lioration des mod{\`e}les multilingues, de la segmentation de ligne et de lánalyse de la mise en page.

View on arXiv

Comments on this paper