Un Supervised NMT

Training NMT models without using a single parallel data sounds like a daunting challenge at first - how is it even possible? There were efforts to train Statistical MT systems without using bitext, notably the paper titled Deciphering foreign language (Ravi & Knight, 2011). In NMT, we see a natural progression of the task in step-by-step manner: first, learn word translations without any data, next, use those aligned embeddings to for sentence translations.

Unsupervised Word alignments, done in embedding space, is generally a two step process:

  1. Learn monolingual embeddings seperately
  2. Learn the transformation function mapping one space to another.

There are plenty of ways to learn word embeddings: Word2vec, Fasttext, Glove, etc Each have their own advantages (and some drawbacks). The task is straight forward application of distributional hypothesis on monolingual data. The second step, learning transformation function for mapping one to another is some what interesting for translation community.

Progression of learning word embedding alignments:

  1. Using a dictionary of word translations to learn the transformation matrix. Exploiting Similarities among Languages for Machine Translation (Mikolov, Le, & Sutskever, 2013)
  2. Much smaller dictionaries: as small as 25 pairs. In many cases those pairs can be automatically obtained, eg: numbers, names etc. See Learning bilingual word embeddings with (almost) no bilingual data (Artetxe, Labaka, & Agirre, 2017)
  3. Fully unsupervised using more advanced techniques for word alignments: Using adversarial training. See Word translation without parallel data (Conneau, Lample, Ranzato, Denoyer, & Jégou, 2017) .

The implementation of (Conneau, Lample, Ranzato, Denoyer, & Jégou, 2017) ‘s approch is made available on github as facebookresearch/MUSE and it is more popular. These unsupervised approaches can sometimes perform better than supervised approach. Here is a visual interepretation of embedding alignment (taken from their github repo):

In summary, unsupervised word alignments exploited these two phenomenons:

  1. Words having similar meaning appear in similar context across languages
  2. There is a linear mapping from one embedding vector space to another which can be easily learned

After having the automatically aligned word embeddings of good enough quality, the next problem to tackle was: “how to do sentence translation without parallel data?”

1. Unsupervised neural machine translation (Artetxe, Labaka, Agirre, & Cho, 2017)

  • Crosslingual embeddings: Skipgram word2vec, 10 neg sample, 10 ctx window, 300 dims. Then aligned using (Artetxe, Labaka, & Agirre, 2017)
  • NMT Architecture: 2 layer Bi-GRU encoder, 2 layers GRU dec, 600 hid dim, 300 dim embs, general attention
    • Both directions: Source -> Target and Target -> Source
    • Shared encoder for both source and target, decoders are seperate. Encoder embeddings are fixed (pretrained, aligned). Decoders are let to evolve.
    • Vocabularies are separate for both languages. So embedding matrices are seperate (but aligned)
  • Denoising: for a seq of $N$ toks, $N/2$ swaps are performed as nimitation of oise. Without noise, copy task is too trivial, AEs doesnt learn any useful representation.
  • On-the-fly back translation: they use the model to generate translation (in the inference mode with greedy dec) then reconstruct from the translation. Backtranslation is much noisier form than random word swaps.
  • Switch the training steps: between Denoise L1 to L1; Denoise L2 to L2; Cycle via BackTranslation: L1 -> L2 -> L1; Cycle via BackTranslation: L2 -> L1 -> L2.
  • Trained on short seqs: 50 or fewer toks (after BPE)
  • Crosslingual embeddings:
  • Took 4-5 days to train. Results on WMT: Fr-en 15.6; De-en 10.2

2. Unsupervised machine translation using monolingual corpora only (Lample, Conneau, Denoyer, & Ranzato, 2017)

  • Cross lingual embeddings from (Conneau, Lample, Ranzato, Denoyer, & Jégou, 2017) i.e. MUSE.
  • Learn to reconstruct in both langs from the shared latent space
  • Denoising auto encoder, reconstruct from a noisy input
  • (This paper has good citations to relevant work; semi supervised, autoencoder etc)

  • LSTM with 300 dims, 3 layers. Two models: Src-to-tgt and tgt-to-src models: all encoder layers are shared, all decoder layers are shared. I assume embeddings are frozen (pretrained, algined). So only the generator matrix is different for both the languages (they are not tied to input embeddings unlike the recent trend). They actually have two models, but LSTM layers are shared.
  • Denoising auto encoders: drop words, shuffle the order of tokens (upper bound k on how many timesteps max a token can move). They found 10% word drop and k=3 be good parameters
  • Cross-domain loss (aka cycle loss) x -> C(x) -> y -> C(y) -> x’. C is corruption operator. Similarly, y -> C(y) -> x -> C(x) -> y’.
  • Adversarial training: Encoding of x and y sentences should be indistinguishable
  • Final objective is: is linear combination of following
    • AE src->src
    • AE tgt->tgt
    • Cross Domain aka CycleLoss src->tgt->src
    • Cross Domain aka CycleLoss tgt->src->src
    • Adversarial Loss
  • Note: for the first epoch, the Cross Domain uses word-by-word translation using MUSE; then later they switch to the model itself as backtranslation
  • Results On WMT fr-en 14.3 while supervised is 26.1; de-en is 13.3 while supervvised is 25.6

3. Unsupervised neural machine translation with weight sharing (Yang, Chen, Wang, & Xu, 2018)

  • (Artetxe, Labaka, Agirre, & Cho, 2017) use a shared encoder; but seperate decoders. (Lample, Conneau, Denoyer, & Ranzato, 2017) share a single encoder and single decoder (fully shared).
  • Conjecture: Fully shared is not good, since languages have different syntax; completely different is also not good, since forcing common latent space is hard. We want to something in the middle: partial sharing.
  • Encoder: share the higher layers; allow the lower layers (close to embeddings) to be different
  • Decoder share the lower layers; allow the higher layers (close to generator module) to be different
    • Question: Decoder too has input embedding too just like encoder, but they are okay shared? Maybe: since word embeddings are in common latent space
  • Two kinds of adversarial discriminators were built in:
    • Prior work (Yang et all 2017, same first author): “Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets”
    • Loss is modified to include regular loss + GAN reward
    • Two Discriminators: Local (encoded repr of source & target are indistinguishable), Global (model generated sentences are indistinguishable from human generated sentences)
  • Slight improvement in BLEU (+1.0) compared to Lample et al 2017

4. Phrase-Based & Neural Unsupervised Machine Translation(Lample, Ott, Conneau, Denoyer, & Ranzato, 2018)

  • Simplification from lample et al 2017 and Artetxe et al 2017. In essence these three work for unsupervised MT (ablation study proved it):
    • Initialization of embeddings via crosslingual embeddings
    • strong language models via denoising autoencoder
    • automatic backtranslation
  • They train NMT and PBSMT (moses). Source code at facebookresearch/UnsupervisedMT
  • NMT training
    • two varieties: LSTM (told that the setup is same as Artetxe et al 2017 but they had used GRUs!), Transformer (4 layers); 512 dim.
    • Shared Encoder, forced interlingua using adversarial loss. Shared Decoder too for regularization
    • BOS (first token) of the shared decoder is the language id token. Encoder’s first token is not modified unlike Johnson et al 2016.
    • Loss is Denoising reconstruction + backtranslation loss ; Note: No adversarial loss ? Maybe bcoz the encoder is shared, no need to explicitly force it to learn interlingua
  • NOTE: the weights of 3 out of 4 layers were shared as suggested by Yang et al 2018 (see above) (as seen in code, not clear from the paper).
  • Validation and model selection via roundtrip BLEU ( x -> y -> x; y-> x->y).
  • Note: Roundtrip BLEU correlates well for transformer but not for LSTMs, nobody knows why! They used a small validation set of 100 parallel sentences for LSTMs validation

Cross-lingual Language Model Pretraining (Lample & Conneau, 2019)

  • Code is made available on gitub at facebookresearch/XLM . (Very well written!)
  • Improvements over lample et al 2018; better initialization
  • Instead of initializing (just) embeddings they train language models encoders with different objectives: causal LM (predict next token given left context); masked LM predict some masked token in sequence given both left and right context
  • If parallel data is available, it can be used to do Translation LM; concat src + tgt sequence, mask out some sequences on both source and target. Nice!!
  • When training Language Models: Mix all the languages; adjust sampling to balance low and high resources
  • Unlike earlier works which only used embeddings for auto encoders; this model initialized encoder and decoder layers from Cross Lingual Language model(XLM).
  • XLM is a BERT like transformer model trained using Masked LM. Embeddings are sum of word emb, postional emb, language embeddings
  • Embeddings, Encoder layers, Generation Matrix weghts are all compatible from XML to NMT’s encoder
  • Decoder gets most of the weights; misses source-attention weights from XLM, but thats okay!
  • Denoising AE and online backtranslation are still the key components
  • High BLEU scores comparable with supervised MT (going to try this)
  • Computationally expensive: Language models are way expensive (they needed 64 top of the class GPUs), MT finetuning is relatively inexpensive however still expensive (they needed 8 GPUs)


  1. Ravi, S., & Knight, K. (2011). Deciphering foreign language. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 12–21).
  2. Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. ArXiv Preprint ArXiv:1309.4168.
  3. Artetxe, M., Labaka, G., & Agirre, E. (2017). Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 451–462).
  4. Conneau, A., Lample, G., Ranzato, M. A., Denoyer, L., & Jégou, H. (2017). Word translation without parallel data. ArXiv Preprint ArXiv:1710.04087.
  5. Artetxe, M., Labaka, G., Agirre, E., & Cho, K. (2017). Unsupervised neural machine translation. ArXiv Preprint ArXiv:1710.11041.
  6. Lample, G., Conneau, A., Denoyer, L., & Ranzato, M. A. (2017). Unsupervised machine translation using monolingual corpora only. ArXiv Preprint ArXiv:1711.00043.
  7. Yang, Z., Chen, W., Wang, F., & Xu, B. (2018). Unsupervised neural machine translation with weight sharing. ArXiv Preprint ArXiv:1804.09057.
  8. Lample, G., Ott, M., Conneau, A., Denoyer, L., & Ranzato, M. A. (2018). Phrase-based & neural unsupervised machine translation. ArXiv Preprint ArXiv:1804.07755.
  9. Lample, G., & Conneau, A. (2019). Cross-lingual Language Model Pretraining. CoRR, abs/1901.07291. Retrieved from http://arxiv.org/abs/1901.07291