Wiki-Translator: Multilingual Experiments for In-Domain Translations
Închide
Conţinutul numărului revistei
Articolul precedent
Articolul urmator
1170 3
Ultima descărcare din IBN:
2017-04-27 21:09
Căutarea după subiecte
similare conform CZU
004.9:81'246.3 (2)
Informatică aplicată. Tehnici bazate pe calculator cu aplicații practice (438)
Lingvistică. Limbi (5021)
SM ISO690:2012
TUFIŞ, Dan, ION, Radu, DUMITRESCU, StefanDaniel. Wiki-Translator: Multilingual Experiments for In-Domain Translations. In: Computer Science Journal of Moldova, 2013, nr. 3(63), pp. 332-359. ISSN 1561-4042.
EXPORT metadate:
Google Scholar
Crossref
CERIF

DataCite
Dublin Core
Computer Science Journal of Moldova
Numărul 3(63) / 2013 / ISSN 1561-4042 /ISSNe 2587-4330

Wiki-Translator: Multilingual Experiments for In-Domain Translations
CZU: 004.9:81'246.3

Pag. 332-359

Tufiş Dan, Ion Radu, Dumitrescu StefanDaniel
 
Institute for Artificial Intelligence, Romanian Academy
 
 
Disponibil în IBN: 10 decembrie 2013


Rezumat

The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, is based on data extracted from comparable corpora from the new domain or from an as close as possible one. This article reports on a slightly diferent approach: building an SMT system entirely from comparable data for the domain of interest. Certainly, the approach is feasible if the comparable corpora are large enough to extract SMT useful data in suficient quantities for a reliable training. The more comparable corpora, the better the results are. Wikipedia is definitely a very good candidate for such an experiment. We report on mass experiments showing significant improvements over a baseline system built from highly similar (almost parallel) text fragments extracted from Wikipedia. The improvements, statistically significant, are related to what we call the level of translational similarity between extracted pairs of sentences. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on sentence pairs extracted from the entire dumps of Wikipedia as of December 2012. Our experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.

Cuvinte-cheie
comparable corpora, extraction of parallel sentences, language model, statistical machine translation, translation models