In practice this trade-off is easily observed, by noticing how the training error can be driven to zero by using a rich hypothesis class, which typically results into overfitting and increased test error. Various theories on learning curves have been proposed to deal with it, where a learning curve is a plot describing performance of a learning system as a function of some parameters, typically training set size. The resulting trade-off, equally well known in statistics and in machine learning, can be expressed in terms of bias versus variance, capacity control, or model selection. The two effects interact with richer classes being better approximators of the target behaviour but requiring more training data to reliably identify the best hypothesis. The performance of every learning system is the result of (at least) two combined effects: the representation power of the hypothesis class, determining how well the system can approximate the target behaviour statistical effects, determining how well the system can approximate the best element of the hypothesis class, based on finite and noisy training information. This justifies our view of a typical phrase-based machine translation model as a learning system and motivates our analysis of the performance on that system.Ī learning system typically considers a class of models, or hypotheses, and tries to find the one element in that class that provides the best prediction of the output on future, unseen input examples. Estimating a machine translation system is therefore similar to learning the mapping between the source/input and the target/output, a problem which has been extensively studied in statistics and in machine learning. In that statistical framework, translation is essentially viewed as the process of associating an input, the source sentence, with an output, the target sentence.
The (relatively) recent development of statistical approaches and especially phrase-based machine translation, or PBMT, has put the focus on the intensive use of large parallel corpora. Traditional approaches to machine translation (MT) relied to a large extent on linguistic analysis. Possible research directions that address this issue include the integration of linguistic rules or the development of active learning procedures. Although the rate of improvement may depend on both the data and the estimation method, it is unlikely that the general shape of the learning curve will change without major changes in the modeling and inference phases. This fundamental limitation seems to be a direct consequence of Zipf law governing textual data. Our results support and illustrate the fact that performance improves by a constant amount for each doubling of the data, across different language pairs, and different systems. We also provide insight into the way statistical machine translation learns from data, including the respective influence of translation and language models, the impact of phrase length on performance, and various unlearning and perturbation analyses.
Our experiments confirm existing and mostly unpublished beliefs about the learning capabilities of statistical machine translation systems. Very accurate Learning Curves are obtained, using high-performance computing, and extrapolations of the projected performance of the system under different conditions are provided. We present an extensive experimental study of Phrase-based Statistical Machine Translation, from the point of view of its learning capabilities.