Need help with undreamt?
Click the “chat” button below for chat support from the developer who created it, or find similar developers for support.

About the developer

459 Stars 70 Forks GNU General Public License v3.0 3 Commits 11 Opened issues


Unsupervised Neural Machine Translation

Services available


Need anything else?

Contributors list

# 127,742
3 commits

UNdreaMT: Unsupervised Neural Machine Translation

This is an open source implementation of our unsupervised neural machine translation system, described in the following paper:

Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. 2018. Unsupervised Neural Machine Translation. In Proceedings of the Sixth International Conference on Learning Representations (ICLR 2018).

If you use this software for academic research, please cite the paper in question:

  author    = {Artetxe, Mikel  and  Labaka, Gorka  and  Agirre, Eneko  and  Cho, Kyunghyun},
  title     = {Unsupervised neural machine translation},
  booktitle = {Proceedings of the Sixth International Conference on Learning Representations},
  month     = {April},
  year      = {2018}

NOTE: This software has been superseded by Monoses, our unsupervised statistical machine translation system. Monoses obtains substantially better results (e.g. 26.2 vs 15.1 BLEU in English-French WMT14), so we strongly recommend that you switch to it.


  • Python 3
  • PyTorch (tested with v0.3)


The following command trains an unsupervised NMT system from monolingual corpora using the exact same settings described in the paper:

python3 --src SRC.MONO.TXT --trg TRG.MONO.TXT --src_embeddings SRC.EMB.TXT --trg_embeddings TRG.EMB.TXT --save MODEL_PREFIX --cuda

The data in the above command should be provided as follows: -

are the source and target language monolingual corpora. They should both be pre-processed so atomic symbols (either tokens or BPE units) are separated by whitespaces. For that purpose, we recommend using Moses to tokenize and truecase the corpora and, optionally, Subword-NMT if you want to use BPE. -
are the source and target language cross-lingual embeddings. In order to obtain them, we recommend training monolingual embeddings in the corpora above using either word2vec or fasttext, and then map them to a shared space using VecMap. Please make sure to cutoff the vocabulary as desired before mapping the embeddings. -
is the prefix of the output model.

Using the above settings, training takes about 3 days in a single Titan Xp. Once training is done, you can use the resulting model for translation as follows:


For more details and additional options, run the above scripts with the



I have seen that you have a separate unsupervised SMT system called Monoses. Which one should I use?

You should definitely use Monoses. It is newer and obtains substantially better results (e.g. 26.2 vs 15.1 BLEU in English-French WMT14), so we strongly recommend that you switch to it.

You claim that your unsupervised NMT system is trained on monolingual corpora alone, but it also requires bilingual embeddings... Isn't that cheating?

Not really, because we also learn the bilingual embeddings from monolingual corpora alone. We use our companion tool VecMap for that.

Can I use this software to train a regular NMT system on parallel corpora?

Yes! You can use the following arguments to make UNdreaMT behave like a regular NMT system:

python3 --src2trg SRC.PARALLEL.TXT TRG.PARALLEL.TXT --src_vocabulary SRC.VOCAB.TXT --trg_vocabulary TRG.VOCAB.TXT --embedding_size 300 --learn_encoder_embeddings --disable_denoising --save MODEL_PREFIX --cuda


Copyright (C) 2018, Mikel Artetxe

Licensed under the terms of the GNU General Public License, either version 3 or (at your option) any later version. A full copy of the license can be found in LICENSE.txt.

We use cookies. If you continue to browse the site, you agree to the use of cookies. For more information on our use of cookies please see our Privacy Policy.