Heterogeneous document processing: case study of mathematical texts
Закрыть
Articolul precedent
Articolul urmator
487 44
Ultima descărcare din IBN:
2024-03-22 17:11
SM ISO690:2012
BURTSEVA, Lyudmila, COJOCARU, Svetlana, MALAHOV, Ludmila, COLESNICOV, Alexandru. Heterogeneous document processing: case study of mathematical texts. In: Mathematics and Information Technologies: Research and Education, Ed. 2021, 1-3 iulie 2021, Chişinău. Chișinău, Republica Moldova: 2021, pp. 96-97.
EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core
Mathematics and Information Technologies: Research and Education 2021
Conferința "Mathematics and Information Technologies: Research and Education"
2021, Chişinău, Moldova, 1-3 iulie 2021

Heterogeneous document processing: case study of mathematical texts


Pag. 96-97

Burtseva Lyudmila, Cojocaru Svetlana, Malahov Ludmila, Colesnicov Alexandru
 
Vladimir Andrunachievici Institute of Mathematics and Computer Science
 
 
Disponibil în IBN: 1 iulie 2021


Rezumat

Most of historical documents have heterogeneous character containing, along with the text, elements of another nature. The aim of our research is to create a web-based platform to process such documents obtaining presentation of nontextual elements in scripting languages. The general problem is to recognize the document layout and then to apply recognition for each type of content [1]. One of subproblems is that of mathematical formula recognition. Text OCR systems can’t adequately recognize formulas. The republishing of a monograph [2] showed that mathematical formulas makes a half of the book. Inclusion of formulas in the reissued version was done manually consuming much more time and efforts than the text processing. Since the introduction of deep learning techniques, the significant progress in formula recognition was achieved. Modern systems recognize rather complex formulas, both printed and handwritten. Open source systems that solve this problem are, for example, im2latex, image2latex, LaTeX-OCR. We tested them and selected LaTeX-OCR. LaTeX-OCR is written in Python and supported by clear instructions of install and run. It is supplied by dataset that contains about 200,000 items and covers all LATEX base macros. This dataset can be supplemented by sample images from documents, which user intend to recognize. We tested also several commercial systems: SESHAT, INFTY, Mathpix, MT-Recognition. The best results were demonstrated by Mathpix. It recognizes complex formulas and texts in many languages. During the testing Mathpix on a page from [2], it made only two errors because of image quality. Commercial systems offer only limited access for free. For example, Mathpix performs free recognition of no more than 50 images per month.