Open-source Neural Machine Translation (NMT) Tools and Frameworks

Date: 17 December 2024

Here is a list of 10 different open-source NMT tools and frameworks that can be used for self-learning, teaching, and overall general translation fun.

1. OpenNMT

OpenNMT is a powerful open-source neural machine translation framework. Originally developed by Harvard NLP, OpenNMT has grown into a robust ecosystem for machine translation and other natural language processing tasks. It is one of the most important frameworks, and other tools have been developed with OpenNMT as the basis (see Argos Translate).

Some of the highlights:

Pre-trained models: OpenNMT provides ready-to-use models for numerous language pairs, which can be fine-tuned for domain-specific tasks such as legal, medical, or technical translations.
Flexibility: It supports both sequence-to-sequence training and transformer architectures.
Extensive toolkit: includes tools for preprocessing data, tokenization, and evaluation.
Multi-platform support: Available in both PyTorch and TensorFlow versions.

Its ability to handle custom datasets and integrate domain-specific knowledge makes it very interesting for us teachers/researchers (and also to developers).

2. Marian NMT

Marian NMT is a high-performance, efficient and scalable NMT framework developed by the Microsoft Translator team. It is suitable for both research and production-grade translation systems. Marian NMT is implemented in C++, which allows for optimised performance and faster processing compared to many Python-based frameworks. It is also the basis of other tools such as Opus-MT and the Bergamot Project.

Key features include:

Highly efficient training: It employs advanced techniques such as quantisation and low-precision arithmetic, which reduce memory usage and speed up computations.
Customisability: it allows users to build and fine-tune models for specific domains or languages by leveraging custom datasets.
Integration capabilities: Its modular design makes it easy to integrate with external tools and workflows.

For more information, visit: https://marian-nmt.github.io/features/

3. Opus-MT

Opus-MT is a neural machine translation framework built on top of Marian NMT, which has been trained with the extensive OPUS parallel corpus, a collection of publicly available multilingual datasets.

Key features include:

Pre-trained models: provides a library of pre-trained models for a wide range of languages, including low-resource languages that are often overlooked by commercial translation systems.
Dataset integration: It supports the seamless integration of parallel corpora from the OPUS collection, which consists of millions of bilingual sentence pairs derived from sources like the European Parliament, OpenSubtitles, and more.
Custom model training: Developers can train their own models using custom datasets.
Focus on multilingualism: The framework is designed to handle diverse language pairs, including many less common combinations.

4. Fairseq (Meta)

Fairseq is an open-source multilingual translation model developed by Meta AI, designed for large-scale translation tasks. The framework supports a wide range of languages, including many low-resource ones.

Technical highlights:

Multilingual capabilities: trained on massive datasets covering numerous language pairs, enabling effective translation across diverse linguistic combinations.
Scalability: The framework is designed to scale to large datasets and high-performance computing environments, making it suitable for production-grade applications.

5. Tensor2Tensor

Tensor2Tensor (T2T), developed by Google Brain, is a highly versatile deep learning library tailored for sequence-to-sequence tasks such as machine translation, text summarisation, and language modelling. Built on TensorFlow, it leverages Google’s expertise in neural network design and optimisation.

Key aspects include:

Wide range of models: Tensor2Tensor includes implementations of state-of-the-art models
Modular design: The library is highly modular, allowing users to mix and match models, datasets, and training configurations for maximum flexibility.
Pre-trained models: Tensor2Tensor offers pre-trained models on large-scale datasets, which can be fine-tuned for specific tasks.
TensorFlow integration: Since T2T is built on TensorFlow, it integrates nicely with other TensorFlow-based tools and frameworks.

6. Joey NMT

Joey NMT is a lightweight neural machine translation framework built on PyTorch. It is designed with simplicity and ease of use in mind, so you can expect less features than the other ones mentioned above.

Key features:

Pre-trained models: includes pre-trained models for several language pairs, providing a good starting point for experimentation.
Customisability: The framework is highly customisable, allowing users to modify training configurations, architectures, and hyperparameters.
Educational focus: its well-documented codebase makes it an excellent learning resource for students and researchers who want to better understand the inner workings of neural machine translation.
PyTorch integration: it benefits from PyTorch’s flexibility and ease of debugging, as it is built entirely in PyTorch.

7. Libre Translate

Libre Translate is an open-source machine translation service that prioritises user privacy and data security. It offers a free and transparent alternative to commercial translation services like Google Translate or DeepL.

Key technical aspects:

Privacy-first approach: Unlike many commercial services, it does not collect or store user data, ensuring complete privacy for users.
Open-source transparency: The codebase is fully open-source (!) allowing users to inspect, modify, and deploy the framework on their own.
REST API support: provides a REST API for developers to integrate translation into their applications.
Lightweight and efficient: easy to deploy on local servers or cloud environments.

8. Bergamot Project

Bergamot Project is an open-source initiative aimed at enabling on-device machine translation directly in web browsers (as an extension that you can add to Mozilla Firefox).

Key features:

On-device, private translation: No data is sent to external servers, ensuring complete user privacy!
Marian NMT integration: The project builds on Marian NMT and optimises it for lightweight browser-based translation.
Real-time translation: Designed for translating web content seamlessly without requiring external plugins.

Particularly useful for browser-based applications where privacy and offline capabilities are critical.

9. Argos Translate

Argos Translate is an open-source machine translation tool built on top of OpenNMT, that focuses on simplicity and offline usage.

Key features:

Offline translation: it works entirely offline, which is ideal for privacy-conscious users.
Pre-trained models: It includes pre-trained models for several language pairs and allows users to add custom models.
Cross-platform support: Available on Windows, MacOS, and Linux, easy-to-use GUI.
Python API: Developers can integrate Argos Translate into their own applications using its Python API.

10. NLLB (No Language Left Behind)

NLLB is an open-source project by Meta AI aimed at building high-quality translation models for low-resource languages. It is part of the broader Fairseq framework (the fourth item above).

Key features:

Focus on low-resource languages: NLLB emphasises increasing translation quality for languages with limited datasets, which is great!
Multilingual training: Supports over 200 languages…
Open access: Models and code are freely available!

Luis Damián Moreno García (PhD, FHEA)