Abstractive Summarization (CNN-DM)

Here are things that I spent a lot of time on, so you don’t have to - especially with regard to preprocessing data for abstractive summarization. It will be pretty disorganized, but bear with me - there might be something useful in here.

Using Rouge

Important: Don’t use pure Python implementations of Rouge!! Use the following Python wrapper for the original Perl package: https://github.com/pltrdy/files2rouge

Preprocessing Data for Abstractive Summarization

https://cs.nyu.edu/~kcho/DMQA/ - Download the stories portion for CNN and DailyMail (you can use gdown)

Preprocess into txt files using this: ↓

https://gist.github.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e

https://gist.github.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e or

wget https://gist.githubusercontent.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e/raw/1ddc2bd4260e503a03c133b4cf0956867a04dcd9/make_datafiles_cnn_dailymail.py

Then, the following pointers:

Learn the BPE vocabulary on the concatenated training source/target (no truncation) (32K vocabulary should be good)
During training/inference, its common practice to truncate to 400 tokens (on the source side)
When evaluating on the test set, tokenize using the Stanford PTB Tokenizer as follows:

    export CLASSPATH=`pwd`/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
    cat $GEN | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $GEN.tokenized
    cat $REF | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $REF.target

(you can install with the following script)

    wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
    unzip stanford-corenlp-full-2016-10-31.zip
``

machelreid/CNN Daily Mail Tips.md

Abstractive Summarization (CNN-DM)

Using Rouge

Preprocessing Data for Abstractive Summarization