Articolul precedent |
Articolul urmator |
497 44 |
Ultima descărcare din IBN: 2024-03-22 17:11 |
SM ISO690:2012 BURTSEVA, Lyudmila, COJOCARU, Svetlana, MALAHOV, Ludmila, COLESNICOV, Alexandru. Heterogeneous document processing: case study of mathematical texts. In: Mathematics and Information Technologies: Research and Education, Ed. 2021, 1-3 iulie 2021, Chişinău. Chișinău, Republica Moldova: 2021, pp. 96-97. |
EXPORT metadate: Google Scholar Crossref CERIF DataCite Dublin Core |
Mathematics and Information Technologies: Research and Education 2021 | ||||||
Conferința "Mathematics and Information Technologies: Research and Education" 2021, Chişinău, Moldova, 1-3 iulie 2021 | ||||||
|
||||||
Pag. 96-97 | ||||||
|
||||||
Descarcă PDF | ||||||
Rezumat | ||||||
Most of historical documents have heterogeneous character containing, along with the text, elements of another nature. The aim of our research is to create a web-based platform to process such documents obtaining presentation of nontextual elements in scripting languages. The general problem is to recognize the document layout and then to apply recognition for each type of content [1]. One of subproblems is that of mathematical formula recognition. Text OCR systems can’t adequately recognize formulas. The republishing of a monograph [2] showed that mathematical formulas makes a half of the book. Inclusion of formulas in the reissued version was done manually consuming much more time and efforts than the text processing. Since the introduction of deep learning techniques, the significant progress in formula recognition was achieved. Modern systems recognize rather complex formulas, both printed and handwritten. Open source systems that solve this problem are, for example, im2latex, image2latex, LaTeX-OCR. We tested them and selected LaTeX-OCR. LaTeX-OCR is written in Python and supported by clear instructions of install and run. It is supplied by dataset that contains about 200,000 items and covers all LATEX base macros. This dataset can be supplemented by sample images from documents, which user intend to recognize. We tested also several commercial systems: SESHAT, INFTY, Mathpix, MT-Recognition. The best results were demonstrated by Mathpix. It recognizes complex formulas and texts in many languages. During the testing Mathpix on a page from [2], it made only two errors because of image quality. Commercial systems offer only limited access for free. For example, Mathpix performs free recognition of no more than 50 images per month. |
||||||
|