Wiki-Translator: Multilingual Experiments for
In-Domain Translations

Tufiş Dan; Ion Radu; Dumitrescu StefanDaniel

Conţinutul numărului revistei

Articolul precedent

Articolul urmator

1170

Ultima descărcare din IBN:
2017-04-27 21:09

Căutarea după subiecte
similare conform CZU

004.9:81'246.3 (2)

Informatică aplicată. Tehnici bazate pe calculator cu aplicații practice (438)

Lingvistică. Limbi (5021)

SM ISO690:2012

TUFIŞ, Dan, ION, Radu, DUMITRESCU, StefanDaniel. Wiki-Translator: Multilingual Experiments for In-Domain Translations. In: Computer Science Journal of Moldova, 2013, nr. 3(63), pp. 332-359. ISSN 1561-4042.

EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core

Computer Science Journal of Moldova

Numărul 3(63) / 2013 / ISSN 1561-4042 /ISSNe 2587-4330

Wiki-Translator: Multilingual Experiments for In-Domain Translations

CZU: 004.9:81'246.3

Pag. 332-359

Tufiş Dan, Ion Radu, Dumitrescu StefanDaniel

Institute for Artificial Intelligence, Romanian Academy

Disponibil în IBN: 10 decembrie 2013

Descarcă PDF

Rezumat

The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, is based on data extracted from comparable corpora from the new domain or from an as close as possible one. This article reports on a slightly diferent approach: building an SMT system entirely from comparable data for the domain of interest. Certainly, the approach is feasible if the comparable corpora are large enough to extract SMT useful data in suficient quantities for a reliable training. The more comparable corpora, the better the results are. Wikipedia is definitely a very good candidate for such an experiment. We report on mass experiments showing significant improvements over a baseline system built from highly similar (almost parallel) text fragments extracted from Wikipedia. The improvements, statistically significant, are related to what we call the level of translational similarity between extracted pairs of sentences. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on sentence pairs extracted from the entire dumps of Wikipedia as of December 2012. Our experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.

Cuvinte-cheie
comparable corpora, extraction of parallel sentences, language model, statistical machine translation, translation models