Here are things that I spent a lot of time on, so you don’t have to - especially with regard to preprocessing data for abstractive summarization. It will be pretty disorganized, but bear with me - there might be something useful in here.
Important: Don’t use pure Python implementations of Rouge!! Use the following Python wrapper for the original Perl package: https://github.com/pltrdy/files2rouge
https://cs.nyu.edu/~kcho/DMQA/ - Download the stories portion for CNN and DailyMail (you can use gdown
)
Preprocess into txt files using this: ↓
https://gist.github.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e
https://gist.github.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e or
wget https://gist.githubusercontent.com/machelreid/6f18b00c02c6d60bc7d8f2568aa3682e/raw/1ddc2bd4260e503a03c133b4cf0956867a04dcd9/make_datafiles_cnn_dailymail.py
Then, the following pointers:
- Learn the BPE vocabulary on the concatenated training source/target (no truncation) (32K vocabulary should be good)
- During training/inference, its common practice to truncate to 400 tokens (on the source side)
- When evaluating on the test set, tokenize using the Stanford PTB Tokenizer as follows:
export CLASSPATH=`pwd`/path/to/stanford-corenlp-full-2016-10-31/stanford-corenlp-3.7.0.jar
cat $GEN | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $GEN.tokenized
cat $REF | java edu.stanford.nlp.process.PTBTokenizer -ioFileList -preserveLines > $REF.target
(you can install with the following script)
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2016-10-31.zip
unzip stanford-corenlp-full-2016-10-31.zip
``