Skip to content

Instantly share code, notes, and snippets.

@khuangaf
Created February 4, 2018 13:14
Show Gist options
  • Save khuangaf/669c43e7a0a907cafcd1a271dcd9d7fc to your computer and use it in GitHub Desktop.
Save khuangaf/669c43e7a0a907cafcd1a271dcd9d7fc to your computer and use it in GitHub Desktop.
Download data
import numpy as np
import os
from random import shuffle
import re
import urllib.request
import zipfile
import lxml.etree
#download the data
urllib.request.urlretrieve("https://wit3.fbk.eu/get.php?path=XML_releases/xml/ted_en-20160408.zip&filename=ted_en-20160408.zip", filename="ted_en-20160408.zip")
# extract subtitle
with zipfile.ZipFile('ted_en-20160408.zip', 'r') as z:
doc = lxml.etree.parse(z.open('ted_en-20160408.xml', 'r'))
input_text = '\n'.join(doc.xpath('//content/text()'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment