MASS

MASS is a novel pre-training method for sequence to sequence based language generation tasks. It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.

MASS can be applied on cross-lingual tasks such as neural machine translation (NMT), and monolingual tasks such as text summarization. The current codebase supports unsupervised NMT (implemented based on XLM).

Credits: the original developers/researchers:

facebookresearch/XLM
   |---microsoft/MASS
        |---<this>

1. Unsupervised MASS (UnMASS)

Unsupervised Neural Machine Translation just uses monolingual data to train the models. During MASS pre-training, the source and target languages are pre-trained in one model, with the corresponding language embeddings to differentiate the languages. During MASS fine-tuning, back-translation is used to train the unsupervised models. We provide pre-trained and fine-tuned models:

Table 1. Table
Languages	Pre-trained Model	Fine-tuned Model	BPE codes	Vocabulary
EN - FR	MODEL	MODEL	BPE_codes	Vocabulary
EN - DE	MODEL	MODEL	BPE_codes	Vocabulary
En - RO	MODEL	MODEL	BPE_codes	Vocabulary

2. Setup

# create a conda env
conda create -n unmass python=3.7 && conda activate unmass

# for development, install it from git repo
git clone git@github.com:thammegowda/unmass.git  && cd unmass
pip install --editable .


# install it from pypi https://pypi.org/project/unmass/
pip install unmass

Most dependencies are automatically installed from pip. Except, apex, which is a bit complex installation process, so it has to be manually installed To install apex, do the following:

The environment variable CUDA_HOME is set and that $CUDA_HOME/bin/nvcc is a valid path
The cuda toolkit version is consistent
1. e.g. if nvcc --version says version it is 10.1, then python -c 'import torch; print(torch.version.cuda)' also says the same version.
You have a newer version of gcc. See gcc --version. (In my trail-errors, gcc >= 4.9 and gcc ⇐ 8.x worked

Once you have met the above requirements, do the following:

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

You should get a message as Successfully installed apex-0.1 if the installation is success. Otherwise, you are on your own to fix the installation (and please update this documentation).

3. Data Ready

We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.

Using XLM tools and prepared vocabs:

cd MASS

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

Preparing from scratch:

unmass-prep --src de --tgt en --data runs/001-ende \
  --mono /Users/tg/work/me/mtdata/data/de-en/train-parts/news_commentary_v14 \
  --para_val /Users/tg/work/me/mtdata/data/de-en/tests/newstest2018_deen

4. Pre-training:

python -m unmass.train --exp_name unmass-enfr  \
--data_path ./data/processed/en-fr/                  \
--lgs 'en-fr'                                        \
--mass_steps 'en,fr'                                 \
--encoder_only false                                 \
--emb_dim 1024                                       \
--n_layers 6                                         \
--n_heads 8                                          \
--dropout 0.1                                        \
--attention_dropout 0.1                              \
--gelu_activation true                               \
--tokens_per_batch 3000                              \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000                                  \
--max_epoch 100                                      \
--eval_bleu true                                     \
--word_mass 0.5                                      \
--min_len 5                                          \
```


During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:
```
epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu ->  7.81
test_fr-en_mt_bleu  -> 11.72
test_en-fr_mt_bleu  ->  8.80

4.1. Fine-tuning

After pre-training, we use back-translation to fine-tune the pre-trained model on unsupervised machine translation:

MODEL=mass_enfr_1024.pth

python -m unmass.train --exp_name unmass-enfr-unmt \
  --exp_name unsupMT_enfr                              \
  --data_path ./data/processed/en-fr/                  \
  --lgs 'en-fr'                                        \
  --bt_steps 'en-fr-en,fr-en-fr'                       \
  --encoder_only false                                 \
  --emb_dim 1024                                       \
  --n_layers 6                                         \
  --n_heads 8                                          \
  --dropout 0.1                                        \
  --attention_dropout 0.1                              \
  --gelu_activation true                               \
  --tokens_per_batch 2000                              \
  --batch_size 32                                             \
  --bptt 256                                           \
  --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
  --epoch_size 200000                                  \
  --max_epoch 30                                       \
  --eval_bleu true                                     \
  --reload_model "$MODEL,$MODEL"                       \

We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:

| Model | Ro-En BLEU (with BT) | |:---------:|:----:| | Baseline | 34.0 | | XLM | 38.5 | | [MASS](https://modelrelease.blob.core.windows.net/mass/mass_mt_enro_1024.pth) | 39.1 |

Download dataset by the below command:

wget https://dl.fbaipublicfiles.com/XLM/codes_enro
wget https://dl.fbaipublicfiles.com/XLM/vocab_enro

./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro

After download the mass pre-trained model from the above link. And use the following command to fine tune:

MODEL=mass_enro_1024.pth

python -m unmass.train  \
        --exp_name unsupMT_enro                              \
        --data_path ./data/processed/en-ro                   \
        --lgs 'en-ro'                                        \
        --bt_steps 'en-ro-en,ro-en-ro'                       \
        --encoder_only false                                 \
        --mt_steps 'en-ro,ro-en'                             \
        --emb_dim 1024                                       \
        --n_layers 6                                         \
        --n_heads 8                                          \
        --dropout 0.1                                        \
        --attention_dropout 0.1                              \
        --gelu_activation true                               \
        --tokens_per_batch 2000                              \
        --batch_size 32                                      \
        --bptt 256                                           \
        --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
        --epoch_size 200000                                  \
        --max_epoch 50                                       \
        --eval_bleu true                                     \
        --reload_model "$MODEL,$MODEL"

4.2. Training Details

MASS-base-uncased uses 32x NVIDIA 32GB V100 GPUs and trains on (Wikipekia + BookCorpus, 16GB) for 20 epochs (float32), batch size is simulated as 4096.

5. Other questions

Q1: When I run this program in multi-gpus or multi-nodes, the program reports errors like ModuleNotFouldError: No module named 'mass'.

A1: This seems a bug in python multiprocessing/spawn.py, a direct solution is to move these files into each relative folder under fairseq. Do not forget to modify the import path in the code.

6. Reference

If you find MASS useful in your work, you can cite the paper as below:

@inproceedings{song2019mass,
    title={MASS: Masked Sequence to Sequence Pre-training for Language Generation},
    author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan},
    booktitle={International Conference on Machine Learning},
    pages={5926--5936},
    year={2019}
}

Unsupervised Masked Sequence-to-Sequence (UnMASS)