The Method for Reducing the Term Vector Size for Category Classification of Text Document

Голуб Татьяна; Тягунова Мария

Conţinutul numărului revistei

Articolul precedent

Articolul urmator

612

Ultima descărcare din IBN:
2021-07-07 17:31

Căutarea după subiecte
similare conform CZU

004.912 (18)

Informatică aplicată. Tehnici bazate pe calculator cu aplicații practice (440)

SM ISO690:2012

ГОЛУБ, Татьяна, ТЯГУНОВА, Мария. Метод уменьшения размера вектора термов для классификации текстовых документов по категориям. In: Problemele Energeticii Regionale, 2019, nr. 1-2(41 S), pp. 84-94. ISSN 1857-0070. DOI: https://doi.org/10.5281/zenodo.3240216

EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core

Problemele Energeticii Regionale

Numărul 1-2(41 S) / 2019 / ISSN 1857-0070

The Method for Reducing the Term Vector Size for Category Classification of Text Document

Metoda de reducere a dimensiunii vectorului termenilor de clasificare a documentelor text pe categorii

Метод уменьшения размера вектора термов для классификации текстовых документов по категориям

DOI:https://doi.org/10.5281/zenodo.3240216

CZU: 004.912

Pag. 84-94

Голуб Татьяна, Тягунова Мария

Запорожский национальный технический университет

Disponibil în IBN: 25 noiembrie 2019

Descarcă PDF

Rezumat

The article proposes a method for reducing time necessary for subsuming a certain document in order to classify the text documents by reducing the term vector size of certain categories. Accord-ing to the method, the term weight factors were calculated for each classification category to imple-ment subsuming process at the stage of training a certain system. As a result of the analysis of the ob-tained data, the individual category terms, whose weight values did not exceed the experimentally de-termined threshold value, were excluded from the term vector of the category by equating them to ze-ro. Those terms were not involved in the further subsuming process at the testing stage. As the input data for the experimental part, the TF-SLF reference method and its modernization CTFSLF according to those described above were proposed. Due to the application of the method proposed, the differen-tial term vector size for each category was decreased. Despite the increase in the compile time of the term vector according to categories, which was performed only once, the calculation time used to de-termine whether or not a document belonged to a specific category decreased without losing the classi-fication quality. In addition, due to the fact that the proposed method excluded the words that were used in the texts frequently, it became possible to exclude the stage of removing the stop words from the pretreatment process of the analyzed text. For the same reason, the problem of misprints and the words "stuck together" in the initial, training sample was solved.

Scopul studiului prezentat în articol a fost de a dezvolta o metodă de reducere a timpului necesar evaluării proprietăților unui document pentru anumite categorii, pentru a clasifica documentele text. Acest obiectiv este realizat prin reducerea dimensiunii vectorului termenilor anumitor categorii. Pentru a implementa procesul de determinare a proprietății unui document dintr-o anumită categorie în stadiul de pregătire a sistemului, conform metodei propuse, ponderile termenilor se calculează separat pentru fiecare categorie de clasificare. Ca rezultat al analizei datelor obținute, termenii categoriilor individuale, ale căror valori ale greutății nu depășesc valoarea pragului determinat experimental, sunt excluse din vectorul termenilor unei anumite categorii prin egalarea lor cu zero. Acești termeni nu sunt implicați în procesul ulterior de evaluare a proprietății unei anumite categorii de documente în etapa de testare. Metoda de referință pentru determinarea ponderii termenilor TF-SLF, descrisă în literatură, și modernizarea acesteia pe categorii în conformitate cu descrierea de mai sus a CTFSLF, a fost utilizată ca date inițiale pentru partea experimentală. Ca rezultat al aplicării metodei propuse, mărimea vectorului termenilor caracteristici pentru fiecare categorie s-a redus, iar în consecință, în ciuda creșterii timpului de compilare a vectorului termenilor în categorii, care este efectuată o dată, timpul de efectuare a calculelor pentru a determina dacă un document aparține unei categorii specifice fără a pierde clasificarea calității de asemenea s-a redus. De asemenea, datorită faptului că metoda propusă exclude termenii frecvent utilizați în texte, devine posibilă excluderea etapei de eliminare a cuvintelor stop din textul analizat din procesul de preprocesare a documentelor

Целью исследования, представленного в статье, была разработка метода для уменьшения времени, затрачиваемого на процесс оценки принадлежности документа отдельным категориям, с целью классификации текстовых документов. Данная цель достигается путем уменьшения размера вектора термов отдельных категорий. Для реализации процесса определения принадлежности документа отдельной категории на этапе обучения системы, согласно предложенному методу, выполняется расчет весовых коэффициентов термов для каждой категории классификации в отдельности. В результате анализа полученных данных термы отдельных категорий, весовые значения которых не превышают экспериментально определенное пороговое значение, исключаются из вектора термов отдельной категории путем приравнивания их к нулю. Данные термы не участвуют в дальнейшем процессе оценки принадлежности документа отдельной категории на этапе тестирования. В качестве исходных данных для проведения экспериментальной части были использованы опорный метод определения весовых значений термов TF-SLF, описанный в литературе, и предложенная авторами его модернизация по категориям согласно приведенному выше описанию CTFSLF. В результате применения предложенного метода уменьшился размер вектора характерных термов для каждой категории, вследствие чего, не смотря на увеличение времени на составление вектора термов по категориям, которое выполняется один раз, уменьшилось время на выполнение расчетов для определения принадлежности документа конкретной категории без потери качества классификации. Также, в связи с тем, что предложенный метод исключает часто используемые в текстах слова, из процесса предварительной обработки документа становится возможным исключить этап удаления стоп-слов из анализируемого текста. По этой же причине решается проблема опечаток и «слипшихся» слов в исходной, обучающей выборке. Таким образом, поставленную в начале цель исследования можно считать достигнутой.

Cuvinte-cheie
text classification, stemming, terms vector, term weight, TF-SLF,

clasificare text, derivare, vector termen, pondere termen, TF-SLF,

классификация текстов, стемминг, вектор термов, вес терма, TF-SLF