Created
July 3, 2014 12:42
-
-
Save khlmnn/3cc07407a002bb1773cd to your computer and use it in GitHub Desktop.
Convert the Wall Street Journal section of the Penn Treebank to CoNLL format
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/sh | |
# | |
# This Gist converts the Wall Street Journal part of the Penn Treebank | |
# (more specifically, sections 2–24) to CoNLL 2007 format using | |
# PennConverter. As suggested by the authors of PennConverter, the script | |
# first applies the NP bracketing patch by David Vadas. | |
# | |
# In order to make this script work, you will need the following files: | |
# | |
# * treebank-3.tar.gz, containing the standard distribution of the PTB | |
# | |
# * PTB_NP_Bracketing_Data_1.0.tgz, containing the NP bracketing patch by | |
# David Vadas (see http://sydney.edu.au/engineering/it/~dvadas1/) | |
# | |
# * pennconverter.jar, containing PennConverter (see | |
# http://nlp.cs.lth.se/software/treebank_converter/) | |
# | |
# Place these files into the same directory as this script and execute the | |
# script. This will produce three files: train.conll (corresponding to WSJ | |
# sections 2–21), dev.conll (section 22), and test.conll (section 23). | |
set -e | |
root=$(pwd) | |
treebank="$root/treebank-3/parsed/mrg/wsj" | |
pennconverter="java -Xmx1G -jar $root/pennconverter.jar -conll2007" | |
tar xzf treebank-3.tar.gz | |
tar xzf PTB_NP_Bracketing_Data_1.0.tgz | |
cd treebank-3 && patch -p1 < $root/PTB_NP_Bracketing_Data_1.0/ptb_wsj_np_bracketing_00_24.diff | |
cat $(for section in $(seq 2 21); do | |
find $treebank/$(printf %02d $section) -name '*.mrg' | |
done) > $root/train.mrg | |
cat $(find $treebank/22 -name '*.mrg') > $root/dev.mrg | |
cat $(find $treebank/23 -name '*.mrg') > $root/test.mrg | |
$pennconverter < $root/train.mrg > $root/train.conll | |
$pennconverter < $root/dev.mrg > $root/dev.conll | |
$pennconverter < $root/test.mrg > $root/test.conll |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment