MASS
MASS is a novel pre-training method for sequence to sequence based language generation tasks. It randomly masks a sentence fragment in the encoder, and then predicts it in the decoder.
MASS can be applied on cross-lingual tasks such as neural machine translation (NMT), and monolingual tasks such as text summarization. The current codebase supports unsupervised NMT (implemented based on XLM).
Credits: the original developers/researchers:
facebookresearch/XLM |---microsoft/MASS |---<this>
1. Unsupervised MASS (UnMASS)
Unsupervised Neural Machine Translation just uses monolingual data to train the models. During MASS pre-training, the source and target languages are pre-trained in one model, with the corresponding language embeddings to differentiate the languages. During MASS fine-tuning, back-translation is used to train the unsupervised models. We provide pre-trained and fine-tuned models:
Languages | Pre-trained Model | Fine-tuned Model | BPE codes | Vocabulary |
---|---|---|---|---|
EN - FR |
||||
EN - DE |
||||
En - RO |
2. Setup
# create a conda env conda create -n unmass python=3.7 && conda activate unmass # for development, install it from git repo git clone git@github.com:thammegowda/unmass.git && cd unmass pip install --editable . # install it from pypi https://pypi.org/project/unmass/ pip install unmass
Most dependencies are automatically installed from pip. Except, apex, which is a bit complex installation process, so it has to be manually installed To install apex, do the following:
-
The environment variable
CUDA_HOME
is set and that$CUDA_HOME/bin/nvcc
is a valid path -
The cuda toolkit version is consistent
-
e.g. if
nvcc --version
says version it is10.1
, thenpython -c 'import torch; print(torch.version.cuda)'
also says the same version.
-
-
You have a newer version of
gcc
. Seegcc --version
. (In my trail-errors, gcc >= 4.9 and gcc ⇐ 8.x worked
Once you have met the above requirements, do the following:
$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
You should get a message as Successfully installed apex-0.1
if the installation is success.
Otherwise, you are on your own to fix the installation (and please update this documentation).
3. Data Ready
We use the same BPE codes and vocabulary with XLM. Here we take English-French as an example.
Using XLM tools and prepared vocabs:
cd MASS wget https://dl.fbaipublicfiles.com/XLM/codes_enfr wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr ./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr
Preparing from scratch:
unmass-prep --src de --tgt en --data runs/001-ende \ --mono /Users/tg/work/me/mtdata/data/de-en/train-parts/news_commentary_v14 \ --para_val /Users/tg/work/me/mtdata/data/de-en/tests/newstest2018_deen
4. Pre-training:
python -m unmass.train --exp_name unmass-enfr \
--data_path ./data/processed/en-fr/ \
--lgs 'en-fr' \
--mass_steps 'en,fr' \
--encoder_only false \
--emb_dim 1024 \
--n_layers 6 \
--n_heads 8 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--tokens_per_batch 3000 \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000 \
--max_epoch 100 \
--eval_bleu true \
--word_mass 0.5 \
--min_len 5 \
```
During the pre-training prcess, even without any back-translation, you can observe the model can achieve some intial BLEU scores:
```
epoch -> 4
valid_fr-en_mt_bleu -> 10.55
valid_en-fr_mt_bleu -> 7.81
test_fr-en_mt_bleu -> 11.72
test_en-fr_mt_bleu -> 8.80
4.1. Fine-tuning
After pre-training, we use back-translation to fine-tune the pre-trained model on unsupervised machine translation:
MODEL=mass_enfr_1024.pth
python -m unmass.train --exp_name unmass-enfr-unmt \
--exp_name unsupMT_enfr \
--data_path ./data/processed/en-fr/ \
--lgs 'en-fr' \
--bt_steps 'en-fr-en,fr-en-fr' \
--encoder_only false \
--emb_dim 1024 \
--n_layers 6 \
--n_heads 8 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--tokens_per_batch 2000 \
--batch_size 32 \
--bptt 256 \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000 \
--max_epoch 30 \
--eval_bleu true \
--reload_model "$MODEL,$MODEL" \
We also provide a demo to use MASS pre-trained model on the WMT16 en-ro bilingual dataset. We provide pre-trained and fine-tuned models:
| Model | Ro-En BLEU (with BT) | |:---------:|:----:| | Baseline | 34.0 | | XLM | 38.5 | | [MASS](https://modelrelease.blob.core.windows.net/mass/mass_mt_enro_1024.pth) | 39.1 |
Download dataset by the below command:
wget https://dl.fbaipublicfiles.com/XLM/codes_enro wget https://dl.fbaipublicfiles.com/XLM/vocab_enro ./get-data-bilingual-enro-nmt.sh --src en --tgt ro --reload_codes codes_enro --reload_vocab vocab_enro
After download the mass pre-trained model from the above link. And use the following command to fine tune:
MODEL=mass_enro_1024.pth
python -m unmass.train \
--exp_name unsupMT_enro \
--data_path ./data/processed/en-ro \
--lgs 'en-ro' \
--bt_steps 'en-ro-en,ro-en-ro' \
--encoder_only false \
--mt_steps 'en-ro,ro-en' \
--emb_dim 1024 \
--n_layers 6 \
--n_heads 8 \
--dropout 0.1 \
--attention_dropout 0.1 \
--gelu_activation true \
--tokens_per_batch 2000 \
--batch_size 32 \
--bptt 256 \
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
--epoch_size 200000 \
--max_epoch 50 \
--eval_bleu true \
--reload_model "$MODEL,$MODEL"
4.2. Training Details
MASS-base-uncased
uses 32x NVIDIA 32GB V100 GPUs and trains on (Wikipekia + BookCorpus, 16GB) for 20 epochs (float32), batch size is simulated as 4096.
5. Other questions
Q1: When I run this program in multi-gpus or multi-nodes, the program reports errors like ModuleNotFouldError: No module named 'mass'
.
A1: This seems a bug in python multiprocessing/spawn.py
, a direct solution is to move these files into each relative folder under fairseq. Do not forget to modify the import path in the code.
6. Reference
If you find MASS useful in your work, you can cite the paper as below:
@inproceedings{song2019mass, title={MASS: Masked Sequence to Sequence Pre-training for Language Generation}, author={Song, Kaitao and Tan, Xu and Qin, Tao and Lu, Jianfeng and Liu, Tie-Yan}, booktitle={International Conference on Machine Learning}, pages={5926--5936}, year={2019} }