
Data v1

Data v2

Many-to-English: Data (v1)

Datasets are available at

train.raw.tsv.gz  # Training data in raw form, before cleaning, deduping and tokenization
train.v1.eng.tok.gz # English  training data, after cleaning and tokenization
train.v1.src.tok.gz # Source training data, after cleaning and tokenization
train.v1.lang.gz   # lang ID of source side sentences
train.v1.prov.gz   # provenance of record (to see where where this record)
train.v1.tok.stats.tsv # stats such as sentence and token count per language
devs-combo-shuf10k-raw+tok.tgz # 10K sentences for validation, randomly sampled from all dev sets
devtests-raw+tok.tgz  # all the dev and test data; both raw and tokenized
citations.bib  # BibTeX of articles which published the datasets collected in this work
prep.tgz  # scripts to prepare datasets from square 1.

train.v1.{eng.tok,src.tok,lang,prov} are plain text files after running gunzip. They should have same number of lines. Line number is the way to cross-reference between them.

You may also prepare these datasets from scratch or revise cleaning mechanisms starting from train.raw.tsv.gz. The prep.tgz file has datatprep.ipynb notebook that contains steps to download, tokenize, deduplicate and filter our bad records.