While there are more than 7000 languages in the world, most translation research efforts have targeted a few high resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and Reader-Translator-Generator (RTG). We demonstrate their usefulness by creating a multilingual NMT model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Using Pretrained Models

Option 1: Using Docker

Step 1. Download image and run.

# pick the latest image from https://hub.docker.com/repository/docker/tgowda/rtg-model
IMAGE=tgowda/rtg-model:500toEng-v1

# To run without using GPU; requires about 5 to 6GB CPU RAM
docker run --rm -i -p 6060:6060 $IMAGE

# Recommended: use GPU (e.g. device=0)
docker run --gpus '"device=0"' --rm -i -p 6060:6060 $IMAGE
Warning
Docker manager may have memory and CPU core constrain enforced on your host. The above image should be permitted to use at least 6GB RAM. To adjust the RAM and CPU cores for docker image, refer to the instructions on web; e.g. https://stackoverflow.com/a/50770267/1506477

When the docker image runs successfully, you may access the translation service at http://localhost:6060/. No data would be shared with any cloud services.

Option 2: Without using Docker

Step 1: Setup a conda environment and install rtg library.
If conda is missing in your system, please install miniconda to get started.

conda create -n rtg python=3.7
conda activate rtg
pip install rtg==0.5.0  # install rtg and its dependencies
conda install -c conda-forge uwsgi  # needed to deploy service

Step 2: Download a model and run

# Pick the latest version
MODEL=rtg500eng-tfm9L6L768d-bsz720k-stp200k-ens05.tgz
# Download
wget http://rtg.isi.edu/many-eng/models/$MODEL
# Extract
tar xvf $MODEL
# Use uWSGI to run it
uwsgi --http 127.0.0.1:6060 --module rtg.serve.app:app --pyargv "/path/to/extracted/dir"
# Alternatively, without uWSGI for quick testing; (not recommended)
# rtg-serve /path/to/extracted/dir
# Also, see "rtg-serve -h" to learn optional arguments for --pyargv "<here>" of uWSGI

Interaction with REST API

API=http://localhost:6060/translate
curl $API --data "source=Comment allez-vous?" \
   --data "source=Bonne journée"

# API also accepts input as JSON data
curl -X POST -H "Content-Type: application/json" $API\
  --data '{"source":["Comment allez-vous?", "Bonne journée"]}'
Note
To learn more about RTG service and how to interact with it, go to RTG Docs

Decoding in Batch Mode

# `pip install rtg==0.5.0` should have already installed sacremoses-xt
pip install sacremoses-xt==0.0.44
sacremoses normalize -q -d -p -c tokenize -a -x -p :web: < input.src > input.src.tok

CUDA_VISIBLE_DEVICES=0   # set GPU device ID
rtg-decode /path/to/model-extract -if input.src.tok -of output.out

# post process; drop <unk>s, detokenize
cut -f1 output.out | sed 's/<unk>//g' | sacremoses detokenize > output.out.detok

Parent-Child Transfer for Low Resource MT

Refer to the documentation at RTG Docs

Learning rate parameter of child model trainer is a crucial parameter: higher learning rate would destroy parent model’s weights, and lower learning rate means less adaptation to child dataset. Hence, learning rate has to be just right; refer to conf.yml files in https://github.com/thammegowda/006-many-to-eng/tree/master/lowres-xfer

Citation

Please use the following article to reference this work:

@misc{gowda2021manytoenglish,
      title={Many-to-English Machine Translation Tools, Data, and Pretrained Models},
      author={Thamme Gowda and Zhao Zhang and Chris A Mattmann and Jonathan May},
      year={2021},
      eprint={2104.00290},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}