Heterogeneous document processing: case study of mathematical texts

Burtseva Lyudmila; Cojocaru Svetlana; Malahov Ludmila; Colesnicov Alexandru

Articolul precedent

Articolul urmator

497

Ultima descărcare din IBN:
2024-03-22 17:11

SM ISO690:2012

BURTSEVA, Lyudmila, COJOCARU, Svetlana, MALAHOV, Ludmila, COLESNICOV, Alexandru. Heterogeneous document processing: case study of mathematical texts. In: Mathematics and Information Technologies: Research and Education, Ed. 2021, 1-3 iulie 2021, Chişinău. Chișinău, Republica Moldova: 2021, pp. 96-97.

EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core

Mathematics and Information Technologies: Research and Education 2021

Conferința "Mathematics and Information Technologies: Research and Education"
2021, Chişinău, Moldova, 1-3 iulie 2021

Heterogeneous document processing: case study of mathematical texts

Pag. 96-97

Burtseva Lyudmila, Cojocaru Svetlana, Malahov Ludmila, Colesnicov Alexandru

Vladimir Andrunachievici Institute of Mathematics and Computer Science

Disponibil în IBN: 1 iulie 2021

Descarcă PDF

Rezumat

Most of historical documents have heterogeneous character containing, along with the text, elements of another nature. The aim of our research is to create a web-based platform to process such documents obtaining presentation of nontextual elements in scripting languages. The general problem is to recognize the document layout and then to apply recognition for each type of content [1]. One of subproblems is that of mathematical formula recognition. Text OCR systems can’t adequately recognize formulas. The republishing of a monograph [2] showed that mathematical formulas makes a half of the book. Inclusion of formulas in the reissued version was done manually consuming much more time and efforts than the text processing. Since the introduction of deep learning techniques, the significant progress in formula recognition was achieved. Modern systems recognize rather complex formulas, both printed and handwritten. Open source systems that solve this problem are, for example, im2latex, image2latex, LaTeX-OCR. We tested them and selected LaTeX-OCR. LaTeX-OCR is written in Python and supported by clear instructions of install and run. It is supplied by dataset that contains about 200,000 items and covers all LATEX base macros. This dataset can be supplemented by sample images from documents, which user intend to recognize. We tested also several commercial systems: SESHAT, INFTY, Mathpix, MT-Recognition. The best results were demonstrated by Mathpix. It recognizes complex formulas and texts in many languages. During the testing Mathpix on a page from [2], it made only two errors because of image quality. Commercial systems offer only limited access for free. For example, Mathpix performs free recognition of no more than 50 images per month.