Skip to content

Instantly share code, notes, and snippets.

@thomwolf
Last active June 7, 2018 08:28
Show Gist options
  • Save thomwolf/5a1d80784b0e53b908d0f50e4de0bbb4 to your computer and use it in GitHub Desktop.
Save thomwolf/5a1d80784b0e53b908d0f50e4de0bbb4 to your computer and use it in GitHub Desktop.
Download and parse an extract of wikitext-2 with ~170k words
import urllib.request
import spacy
with urllib.request.urlopen('https://raw.githubusercontent.com/pytorch/examples/master/word_language_model/data/wikitext-2/valid.txt') as response:
text = response.read()
nlp = spacy.load('en')
doc_list = list(nlp(text[:800000].decode('utf8')) for i in range(10))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment