While there are more than 7000 languages in the world, most translation research efforts have targeted a few high resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and Reader-Translator-Generator (RTG). We demonstrate their usefulness by creating a multilingual NMT model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.

Using Pretrained Models

Option 1: Using Docker

Step 1. Download image and run.

# pick the latest image from https://hub.docker.com/repository/docker/tgowda/rtg-model
IMAGE=tgowda/rtg-model:500toEng-v1

# To run without using GPU; requires about 5 to 6GB CPU RAM
docker run --rm -i -p 6060:6060 $IMAGE

# Recommended: use GPU (e.g. device=0)
docker run --gpus '"device=0"' --rm -i -p 6060:6060 $IMAGE
Warning
Docker manager may have memory and CPU core constrain enforced on your host. The above image should be permitted to use at least 6GB RAM. To adjust the RAM and CPU cores for docker image, refer to the instructions on web; e.g. stackoverflow.com/a/50770267/1506477

When the docker image runs successfully, you may access the translation service at localhost:6060/. No data would be shared with any cloud services.

Option 2: Without using Docker

Step 1: Setup a conda environment and install rtg library.
If conda is missing in your system, please install miniconda to get started.

conda create -n rtg python=3.7
conda activate rtg
pip install rtg==0.5.0  # install rtg and its dependencies
conda install -c conda-forge uwsgi  # needed to deploy service

Step 2: Download a model and run

# Pick the latest version
MODEL=rtg500eng-tfm9L6L768d-bsz720k-stp200k-ens05.tgz
# Download
wget http://rtg.isi.edu/many-eng/models/$MODEL
# Extract
tar xvf $MODEL
# Use uWSGI to run it
uwsgi --http 127.0.0.1:6060 --module rtg.serve.app:app --pyargv "/path/to/extracted/dir"
# Alternatively, without uWSGI for quick testing; (not recommended)
# rtg-serve /path/to/extracted/dir
# Also, see "rtg-serve -h" to learn optional arguments for --pyargv "<here>" of uWSGI

Interaction with REST API

API=http://localhost:6060/translate
curl $API --data "source=Comment allez-vous?" \
   --data "source=Bonne journée"

# API also accepts input as JSON data
curl -X POST -H "Content-Type: application/json" $API\
  --data '{"source":["Comment allez-vous?", "Bonne journée"]}'
Note
To learn more about RTG service and how to interact with it, go to RTG Docs

Decoding in Batch Mode

# `pip install rtg==0.5.0` should have already installed sacremoses-xt
pip install sacremoses-xt==0.0.44
sacremoses normalize -q -d -p -c tokenize -a -x -p :web: < input.src > input.src.tok

CUDA_VISIBLE_DEVICES=0   # set GPU device ID
rtg-decode /path/to/model-extract -if input.src.tok -of output.out

# post process; drop <unk>s, detokenize
cut -f1 output.out | sed 's/<unk>//g' | sacremoses detokenize > output.out.detok

Parent-Child Transfer for Low Resource MT

Refer to the documentation at RTG Docs

Learning rate parameter of child model trainer is a crucial parameter: higher learning rate would destroy parent model’s weights, and lower learning rate means less adaptation to child dataset. Hence, learning rate has to be just right; refer to conf.yml files in github.com/thammegowda/006-many-to-eng/tree/master/lowres-xfer

Citation

Please use the following article to reference this work:

@inproceedings{gowda-etal-2021-many,
    title = "Many-to-{E}nglish Machine Translation Tools, Data, and Pretrained Models",
    author = "Gowda, Thamme  and
      Zhang, Zhao  and
      Mattmann, Chris  and
      May, Jonathan",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-demo.37",
    doi = "10.18653/v1/2021.acl-demo.37",
    pages = "306--316",
}

Acknowledgments

  • The research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via AFRL Contract #FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

  • This material is based on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein.