In this (optional) MP you will prepare a corpus of free English newswire text for use in later assignments. This is intended to provide practical Python programming experience, but will not be graded.
For this task we will use a subset of the News Crawl corpus consisting of data from the year 2009. This (very large: 3.7 GB) file is available here as a gzipped TAR file.
-
Download the file:
$ curl -O http://www.statmt.org/wmt11/training-monolingual-news-2009.tgz
-
Compare the published SHA-1 checksum to the checksum of the data you just downloaded to make sure no corruption has occurred:
$ shasum training-monolingual-news-2007.tgz
-
Extract the English text file and delete the tarball; the result is a file called
training-monolingual/news.2009.en.shuffled
, containing one sentence per line.$ tar -xzf training-monolingual-news-2009.tgz training-monolingual/news.2009.en.shuffled && \ rm training-monolingual-news-2009.tgz
-
Perform word tokenization using a tokenizer like NLTK's
TreebankWordTokenizer
, which applies the Penn Treebank rules. -
Case-fold the tokens.
-
Print out the result with one space between each token.
After processing, the first line should read:
health care reform , energy , global warming , education , not to mention the economy .
While case-folding can be done from the command line, Python does a better job so long as the string data is stored in a Unicode string (str
in Python 3) and not as a byte string (bytes
).
- Remove stopwords using one of the NLTK stopword lists for English. (Hint: this will be much faster if you convert the list of stopwords to a
frozenset
. - Instead of using the Treebank tokenizer, use UDPipe and a English tokenizer learned from data. Naturally, you will need to install UDPipe.
- Once everything is working, convert the entire process to a Bash script.