Skip to content

Instantly share code, notes, and snippets.

@cuevasclemente
Created April 29, 2019 18:05
Show Gist options
  • Save cuevasclemente/b4e1efe2f530a3821c96f318a69457cf to your computer and use it in GitHub Desktop.
Save cuevasclemente/b4e1efe2f530a3821c96f318a69457cf to your computer and use it in GitHub Desktop.
Parse Wikpedia Articles (after extraction with wikiextractor, but if you strip punctuation from all tokens this might work with raw wikipedia xml export)
with open("./wikipedia_articles_text") as f:
article_text = f.read()
articles = article_text.split("</doc>")
documents = []
for i, article in enumerate(articles):
lines = article.split("\n")
if i == 0:
title = lines[1]
text = "\n".join(lines[3:])
else:
title = None
text = None
if len(lines) > 3:
title = lines[2]
text = "\n".join(lines[4:]).strip("\n")
if text == None:
continue
documents.append({
"title": title,
"text": text
})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment