by Thamme Gowda in Paper tags: NMT

Collaborators: Zhao Zhang, Chris A Mattmann, and Jonathan May

Abstract

We present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Summary

We announce three tools for machine translation:

MTData

A tool to download machine translation datasets https://github.com/thammegowda/mtdata/
pip install mtdata

NLCodec

https://isi-nlp.github.io/nlcodec/
pip install nlcodec

RTG

https://isi-nlp.github.io/rtg/
pip install rtg

Scripts, tables, and charts: https://github.com/thammegowda/006-many-to-eng

Using these tools we first, collected a massive dataset.

Download Data

And then we trained massively multilingual models for many-to-English translation.

We provide a docker image with pretrained model which can be readily usable as a translation service:

# pick the latest image from https://hub.docker.com/repository/docker/tgowda/rtg-model
IMAGE=tgowda/rtg-model:500toEng-v1

# To run without using GPU; requires about 5 to 6GB CPU RAM
docker run --rm -i -p 6060:6060 $IMAGE

# Recommended: use GPU (e.g. device=0)
docker run --gpus '"device=0"' --rm -i -p 6060:6060 $IMAGE

If you prefer to not use docker, then

Step 1: Setup a conda environment and install rtg library.

If conda is missing in your system, please install miniconda to get started.
conda create -n rtg python=3.7
conda activate rtg
pip install rtg==0.5.0  # install rtg and its dependencies
conda install -c conda-forge uwsgi  # needed to deploy service
Download a model and run
# Pick the latest version
MODEL=rtg500eng-tfm9L6L768d-bsz720k-stp200k-ens05.tgz
wget http://rtg.isi.edu/many-eng/models/$MODEL

tar xvf $MODEL    # Extract and run
uwsgi --http 127.0.0.1:6060 --module rtg.serve.app:app --pyargv "/path/to/extracted/dir"
# Alternatively, without uWSGI (not recommended)
# rtg-serve /path/to/extracted/dir
# See "rtg-serve -h" to learn optional arguments for --pyargv "<here>" of uWSGI
Interaction with REST API
API=http://localhost:6060/translate
curl $API --data "source=Comment allez-vous?" \
   --data "source=Bonne journée"

# API also accepts input as JSON data
curl -X POST -H "Content-Type: application/json" $API\
  --data '{"source":["Comment allez-vous?", "Bonne journée"]}'
To learn more about RTG service and how to interact with it, go to RTG Docs
Decoding in Batch Mode
# `pip install rtg==0.5.0` should have already installed sacremoses-xt
pip install sacremoses-xt==0.0.44
sacremoses normalize -q -d -p -c tokenize -a -x -p :web: < input.src > input.src.tok

CUDA_VISIBLE_DEVICES=0   # set GPU device ID
rtg-decode /path/to/model-extract -if input.src.tok -of output.out

# post process; drop <unk>s, detokenize
cut -f1 output.out | sed 's/<unk>//g' | sacremoses detokenize > output.out.detok
Parent-Child Transfer for Low Resource MT

The pretrained model can be adopted to a specific dataset, using parent-child transfer setup.

Learning rate of child model trainer is a crucial parameter: higher learning rate would destroy parent model’s weights, and lower learning rate means less adaptation to child dataset. Hence, learning rate has to be just right; refer to conf.yml files in https://github.com/thammegowda/006-many-to-eng/tree/master/lowres-xfer

Citation

@inproceedings{gowda-etal-2021-many,
    title = "Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models",
    author = "Gowda, Thamme  and
      Zhang, Zhao  and
      Mattmann, Chris  and
      May, Jonathan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.37",
    doi = "10.18653/v1/2021.acl-demo.37",
    pages = "306--316",
}

Acknowledgements

Thanks to USC CARC, and TACC for providing computing resources. Thanks to Jörg Tiedemann for hosting the dataset at OPUS.