Platform for Digitization of Heterogeneous Documents

Bumbu Tudor; Burtseva Lyudmila; Cojocaru Svetlana; Colesnicov Alexandru; Malahov Ludmila

Articolul precedent

Articolul urmator

209

Ultima descărcare din IBN:
2024-01-11 13:45

SM ISO690:2012

BUMBU, Tudor, BURTSEVA, Lyudmila, COJOCARU, Svetlana, COLESNICOV, Alexandru, MALAHOV, Ludmila. Platform for Digitization of Heterogeneous Documents. In: Conference on Applied and Industrial Mathematics: CAIM 2022, Ed. 29, 25-27 august 2022, Chişinău. Chișinău, Republica Moldova: Casa Editorial-Poligrafică „Bons Offices”, 2022, Ediţia a 29, pp. 170-171. ISBN 978-9975-81-074-6.

EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core

Conference on Applied and Industrial Mathematics
Ediţia a 29, 2022

Conferința "Conference on Applied and Industrial Mathematics"
29, Chişinău, Moldova, 25-27 august 2022

Platform for Digitization of Heterogeneous Documents

Pag. 170-171

Bumbu Tudor, Burtseva Lyudmila, Cojocaru Svetlana, Colesnicov Alexandru, Malahov Ludmila

Vladimir Andrunachievici Institute of Mathematics and Computer Science

Disponibil în IBN: 21 decembrie 2022

Descarcă PDF

Rezumat

The digitization platform is a web/desktop application written in Python and Javascript, which integrates the processing stages of heterogeneous documents into a digitization cycle, consisting of the following main steps: uploading images or/and PDF files, image preprocessing, optical recognition of characters in the image, checking and editing of recognized text, transliteration of text after checking the recognized text, checking and editing the transliterated text and finally saving the results to the database and/or downloading them. These steps, in turn, branch into a list of sub-steps, which we will detail below. It is worth mentioning the technical peculiarities of implementing this platform, namely: - some of the data operations, e.g. some image processing methods, are available as JavaScript libraries and are therefore executed in the front-end; - some services, mainly heterogeneous content recognition, are called from the network via their own APIs, which essentially means that there are multiple backends. Most of the steps contain submenus, through which the user can choose one of the proposed options (tools). For example, preprocessing of the initial image can be done using FineReader, Open CV, ScanTailor or Gimp. One of the most important stages is OCR, this step is applied to the preprocessed document and starts with classifying the heterogeneous content and fragmenting the document into homogeneous components. For the time being the following types of sub-items are provided: image, text, musical notes, mathematical formulas, chemical formulas and structures, chess diagrams. A user-friendly interface is developed in the form of a dialog, which works via API on the user side (frontend) and on the server side (backend): the image file can be uploaded, the heterogeneous content can be identified, it can be split into fragments, it can be analysed and recognised, the resulting file can be viewed with the possibility of saving it.