Skip to content

Instantly share code, notes, and snippets.

@gaphex
Created May 9, 2019 14:52
Show Gist options
  • Save gaphex/b0446b60dda6d17e7feb5174bb9c35c1 to your computer and use it in GitHub Desktop.
Save gaphex/b0446b60dda6d17e7feb5174bb9c35c1 to your computer and use it in GitHub Desktop.
Downloading the OPUS dataset
AVAILABLE = {'af','ar','bg','bn','br','bs','ca','cs',
'da','de','el','en','eo','es','et','eu',
'fa','fi','fr','gl','he','hi','hr','hu',
'hy','id','is','it','ja','ka','kk','ko',
'lt','lv','mk','ml','ms','nl','no','pl',
'pt','pt_br','ro','ru','si','sk','sl','sq',
'sr','sv','ta','te','th','tl','tr','uk',
'ur','vi','ze_en','ze_zh','zh','zh_cn',
'zh_en','zh_tw','zh_zh'}
LANG_CODE = "en" #@param {type:"string"}
assert LANG_CODE in AVAILABLE, "Invalid language code selected"
!wget http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2016/mono/OpenSubtitles.raw.'$LANG_CODE'.gz -O dataset.txt.gz
!gzip -d dataset.txt.gz
!tail dataset.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment