Collaborators: Zhao Zhang, Chris A Mattmann, and Jonathan May
We present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.
We announce three tools for machine translation:
A tool to download machine translation datasets
https://github.com/thammegowda/mtdata/
pip install mtdata
https://isi-nlp.github.io/nlcodec/
pip install nlcodec
https://isi-nlp.github.io/rtg/
pip install rtg
Scripts, tables, and charts: https://github.com/thammegowda/006-many-to-eng |
Using these tools we first, collected a massive dataset.
Also available from OPUS https://opus.nlpl.eu/MT560.php
And then we trained massively multilingual models for many-to-English translation.
Recommended: rtg500eng-tfm9L6L768d-bsz720k-stp200k-ens05.tgz |
We provide a docker image with pretrained model which can be readily usable as a translation service:
# pick the latest image from https://hub.docker.com/repository/docker/tgowda/rtg-model
IMAGE=tgowda/rtg-model:500toEng-v1
# To run without using GPU; requires about 5 to 6GB CPU RAM
docker run --rm -i -p 6060:6060 $IMAGE
# Recommended: use GPU (e.g. device=0)
docker run --gpus '"device=0"' --rm -i -p 6060:6060 $IMAGE
If you prefer to not use docker, then
Step 1: Setup a conda
environment and install rtg
library.
If conda is missing in your system, please install miniconda to get started.
|
conda create -n rtg python=3.7
conda activate rtg
pip install rtg==0.5.0 # install rtg and its dependencies
conda install -c conda-forge uwsgi # needed to deploy service
# Pick the latest version
MODEL=rtg500eng-tfm9L6L768d-bsz720k-stp200k-ens05.tgz
wget http://rtg.isi.edu/many-eng/models/$MODEL
tar xvf $MODEL # Extract and run
uwsgi --http 127.0.0.1:6060 --module rtg.serve.app:app --pyargv "/path/to/extracted/dir"
# Alternatively, without uWSGI (not recommended)
# rtg-serve /path/to/extracted/dir
# See "rtg-serve -h" to learn optional arguments for --pyargv "<here>" of uWSGI
Web Interface: a simple web interface is made at http://localhost:6060.
REST API is available at http://localhost:6060/translate.
An example interaction with REST API:
API=http://localhost:6060/translate curl $API --data "source=Comment allez-vous?" \ --data "source=Bonne journée" # API also accepts input as JSON data curl -X POST -H "Content-Type: application/json" $API\ --data '{"source":["Comment allez-vous?", "Bonne journée"]}'
To learn more about RTG service and how to interact with it, go to RTG Docs |
# `pip install rtg==0.5.0` should have already installed sacremoses-xt
pip install sacremoses-xt==0.0.44
sacremoses normalize -q -d -p -c tokenize -a -x -p :web: < input.src > input.src.tok
CUDA_VISIBLE_DEVICES=0 # set GPU device ID
rtg-decode /path/to/model-extract -if input.src.tok -of output.out
# post process; drop <unk>s, detokenize
cut -f1 output.out | sed 's/<unk>//g' | sacremoses detokenize > output.out.detok
The pretrained model can be adopted to a specific dataset, using parent-child transfer setup.
Learning rate of child model trainer is a crucial parameter: higher learning rate would destroy parent model’s weights, and lower learning rate means less adaptation to child dataset.
Hence, learning rate has to be just right; refer to conf.yml
files in https://github.com/thammegowda/006-many-to-eng/tree/master/lowres-xfer
@inproceedings{gowda-etal-2021-many, title = "Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models", author = "Gowda, Thamme and Zhang, Zhao and Mattmann, Chris and May, Jonathan", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.acl-demo.37", doi = "10.18653/v1/2021.acl-demo.37", pages = "306--316", }
Thanks to USC CARC, and TACC for providing computing resources. Thanks to Jörg Tiedemann for hosting the dataset at OPUS.