Abstract

We explore efficient domain adaptation for the task of statistical machine translation based on extracting sentences from a large generaldomain parallel corpus that are most relevant to the target domain. These sentences may be selected with simple cross-entropy based methods, of which we present three. As these sentences are not themselves identical to the in-domain data, we call them pseudo in-domain subcorpora. These subcorpora – 1 % the size of the original – can then used to train small domain-adapted Statistical Machine Translation (SMT) systems which outperform systems trained on the entire corpus. Performance is further improved when we use these domain-adapted models in combination with a true in-domain model. The results show that more training data is not always better, and that best results are attained via proper domain-relevant data selection, as well as combining in- and general-domain systems during decoding. 1

Keywords

Computer scienceMachine translationDomain adaptationDomain (mathematical analysis)Artificial intelligenceEntropy (arrow of time)Natural language processingSelection (genetic algorithm)Decoding methodsMachine learningPattern recognition (psychology)AlgorithmClassifier (UML)Mathematics

Affiliated Institutions

Related Publications

MizAR 60 for Mizar 50

As a present to Mizar on its 50th anniversary, we develop an AI/TP system that automatically proves about 60% of the Mizar theorems in the hammer setting. We also automatically ...

2023 Leibniz-Zentrum für Informatik (Schlo... 70225 citations

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem.Previous work addresses the translation of out-o...

Edinburgh Research Explorer (Universi... 6994 citations

Publication Info

Year
2011
Type
article
Pages
355-362
Citations
492
Access
Closed

External Links

Citation Metrics

492
OpenAlex

Cite This

Amittai Axelrod, Xiaodong He, Jianfeng Gao (2011). Domain Adaptation via Pseudo In-Domain Data Selection. , 355-362.