Many-to-English: Data (v1)
Datasets are available at
train.raw.tsv.gz # Training data in raw form, before cleaning, deduping and tokenization
train.v1.eng.tok.gz # English training data, after cleaning and tokenization
train.v1.src.tok.gz # Source training data, after cleaning and tokenization
train.v1.lang.gz # lang ID of source side sentences
train.v1.prov.gz # provenance of record (to see where where this record)
train.v1.tok.stats.tsv # stats such as sentence and token count per language
devs-combo-shuf10k-raw+tok.tgz # 10K sentences for validation, randomly sampled from all dev sets
devtests-raw+tok.tgz # all the dev and test data; both raw and tokenized
citations.bib # BibTeX of articles which published the datasets collected in this work
prep.tgz # scripts to prepare datasets from square 1.
train.v1.{eng.tok,src.tok,lang,prov}
are plain text files after running gunzip.
They should have same number of lines. Line number is the way to cross-reference between them.
You may also prepare these datasets from scratch or revise cleaning mechanisms starting from train.raw.tsv.gz
. The prep.tgz
file has datatprep.ipynb
notebook that contains steps to download, tokenize, deduplicate and filter our bad records.