Skip to content

Instantly share code, notes, and snippets.

@ashunigion
Created February 1, 2019 04:40
Show Gist options
  • Save ashunigion/783def310eedbe6e608ca8e18bc5d85b to your computer and use it in GitHub Desktop.
Save ashunigion/783def310eedbe6e608ca8e18bc5d85b to your computer and use it in GitHub Desktop.
import nltk
nltk.download('reuters')
from nltk.corpus import reuters
def read_corpus(category="crude"):
""" Read files from the specified Reuter's category. And adds
START and END to beginning and end of each document.
Params:
category (string): category name
Return:
list of lists, with words from each of the processed files
"""
files = reuters.fileids(category)
return [[START_TOKEN] + [w.lower() for w in list(reuters.words(f))] + [END_TOKEN] for f in files]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment