Skip to content

Instantly share code, notes, and snippets.

@kylebgorman
Created September 14, 2018 00:08
Show Gist options
  • Save kylebgorman/3e28fc834962017c9ac01f7434485519 to your computer and use it in GitHub Desktop.
Save kylebgorman/3e28fc834962017c9ac01f7434485519 to your computer and use it in GitHub Desktop.
LING83600-mp00.md

MP 0: Text preparation

In this (optional) MP you will prepare a corpus of free English newswire text for use in later assignments. This is intended to provide practical Python programming experience, but will not be graded.

For this task we will use a subset of the News Crawl corpus consisting of data from the year 2009. This (very large: 3.7 GB) file is available here as a gzipped TAR file.

What to do

  1. Download the file:

     $ curl -O http://www.statmt.org/wmt11/training-monolingual-news-2009.tgz
    
  2. Compare the published SHA-1 checksum to the checksum of the data you just downloaded to make sure no corruption has occurred:

     $ shasum training-monolingual-news-2007.tgz
    
  3. Extract the English text file and delete the tarball; the result is a file called training-monolingual/news.2009.en.shuffled, containing one sentence per line.

     $ tar -xzf training-monolingual-news-2009.tgz 
           training-monolingual/news.2009.en.shuffled && \
           rm training-monolingual-news-2009.tgz
    
  4. Perform word tokenization using a tokenizer like NLTK's TreebankWordTokenizer, which applies the Penn Treebank rules.

  5. Case-fold the tokens.

  6. Print out the result with one space between each token.

Test support

After processing, the first line should read:

health care reform , energy , global warming , education , not to mention the economy .

Hints

While case-folding can be done from the command line, Python does a better job so long as the string data is stored in a Unicode string (str in Python 3) and not as a byte string (bytes).

Stretch goals

  • Remove stopwords using one of the NLTK stopword lists for English. (Hint: this will be much faster if you convert the list of stopwords to a frozenset.
  • Instead of using the Treebank tokenizer, use UDPipe and a English tokenizer learned from data. Naturally, you will need to install UDPipe.
  • Once everything is working, convert the entire process to a Bash script.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment