eTranslation (https://language-tools.ec.europa.eu) is the European Commission’s machine translation (MT) service, a flagship AI project by DG Translation in partnership with DG CONNECT under the Digital Europe programme.
eTranslation helps European and national public administrations as well as SMEs and NGOs exchange information across language barriers in the EU. It provides secure access to neural machine translation between all 26 official languages of the EU and the EEA. The service leverages the European Institutions’ high-quality internal translation data (the Euramis translation memories comprising over 1 billion sentences in the 24 official EU languages) to provide specialised models for EU formal language. The service also combines that data with additional large volumes of translation data from external sources such as the European Language Resource Coordination, Paracrawl and Opus to offer general domain MT models in 32 languages. With over 200 million pages translated yearly, it is one of the most used IT services of the Commission, both internally and by outside users.
A machine translation service of this scale requires substantial computational power and a continuous search for the right balance between the use of available resources and the best possible performance of the models. Recent developments in neural models for natural language processing clearly show a trend of increasing model sizes, since more complex model architectures generally have better performance and machine translation models are no exceptions: deeper model architectures are much better at maximizing the use of the information in the training data set. However, the computational infrastructure support needed for such powerful complex model ensembles is huge, more than most MT service providers can normally afford.
The project tries to give an answer to the resource-performance dilemma by using the method of knowledge distillation, which targets the model capacity problem. It aims to transfer the knowledge of a high capacity, but computationally expensive, teacher model to a smaller, more efficient student model with the least possible loss of performance. In the context of machine translation, knowledge distillation (KD) uses the output of a complex teacher model (or model ensemble) to produce a cost-effective, fast, production-ready student model.
KD experiments in machine translation have so far been conducted with moderate amounts of training data. With tens of millions of high quality segments from the European Institutions’ Euramis translation memories, we are training very strong teacher (ensemble) models and aim to investigate how the approach can scale up to produce competitive student models both in bilingual and multilingual scenarios that can later be deployed to maximize speed and efficiency while minimizing the supporting resource infrastructure.
Models for 6 language pairs are being built, that provide the basis for a potential subsequent scaling up to cover all EU and EEA official languages or all languages covered by eTranslation.
Substantial benefits are expected from the project both from a research perspective and in terms of further development for the general user community. The experiments will offer valuable insights into how knowledge distillation methods in machine translation can scale up with high quality, high resource data sets both in multilingual and bilingual settings. A number of teacher models will be released to offer the machine translation community a set of strong, high quality models as a base to support high quality MT services in the EU formal language domain. That will also be a contribution to the new European Language Data Space under the Digital Europe programme.
On the other hand, the resulting student models are expected to bring significant improvement in the eTranslation service to perform on par with or, in selected domains, better than the best commercial machine translation systems. That will be of direct and immediate benefit to all EU institutions, bodies and agencies in their everyday work, but also to public administrations, SMEs, NGOs and universities using eTranslation throughout Europe.
Multilingualism is one of the essential principles of the European Union, and the development and optimization of the eTranslation models and service through the work described will be a direct contribution to supporting and improving multilingualism and democratic processes in Europe.
Jörgen Gren, Director for Resources at DG Translation, in charge of machine translation and other activities in the area of artificial intelligence