MP 0: Text preparation

In this (optional) MP you will prepare a corpus of free English newswire text for use in later assignments. This is intended to provide practical Python programming experience, but will not be graded.

For this task we will use a subset of the News Crawl corpus consisting of data from the year 2009. This (very large: 3.7 GB) file is available here as a gzipped TAR file.

What to do

Download the file:

 $ curl -O http://www.statmt.org/wmt11/training-monolingual-news-2009.tgz

Compare the published SHA-1 checksum to the checksum of the data you just downloaded to make sure no corruption has occurred:
```
 $ shasum training-monolingual-news-2007.tgz
```
Extract the English text file and delete the tarball; the result is a file called training-monolingual/news.2009.en.shuffled, containing one sentence per line.
```
 $ tar -xzf training-monolingual-news-2009.tgz 
       training-monolingual/news.2009.en.shuffled && \
       rm training-monolingual-news-2009.tgz
```
Perform word tokenization using a tokenizer like NLTK's TreebankWordTokenizer, which applies the Penn Treebank rules.
Case-fold the tokens.
Print out the result with one space between each token.

Test support

After processing, the first line should read:

health care reform , energy , global warming , education , not to mention the economy .

Hints

While case-folding can be done from the command line, Python does a better job so long as the string data is stored in a Unicode string (str in Python 3) and not as a byte string (bytes).

Stretch goals

Remove stopwords using one of the NLTK stopword lists for English. (Hint: this will be much faster if you convert the list of stopwords to a frozenset.
Instead of using the Treebank tokenizer, use UDPipe and a English tokenizer learned from data. Naturally, you will need to install UDPipe.
Once everything is working, convert the entire process to a Bash script.

kylebgorman/LING83600-mp00.md

MP 0: Text preparation

What to do

Test support

Hints

Stretch goals