- Get the text data:
wget http://kyoto.let.vu.nl/~miltenburg/public_data/wikicorpus/corpus/wikicorpus.txt.gz
- Get the code for the structured n-grams:
wget https://github.com/wlin12/wang2vec/archive/master.zip
- Run
unzip master.zip ; rm master.zip
- Build the word vector code: Run
cd wang2vec-master/ ; make ; cd ..
- Train CBOW vectors: Run
./wang2vec-master/word2vec -train wikicorpus.txt -output cbow.vectors -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -iter 5 -cap 0 >> training.log 2>&1 &
- Train Structured skipngram vectors: Run
./wang2vec-master/word2vec -train wikicorpus.txt -output structured_ngram.vectors -type 3 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -iter 5 -cap 0 >> training_ssg.log 2>&1 &
- Get the code for the parser: Run
wget https://github.com/elikip/bist-parser/archive/b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip
- Unzip the data: Run
unzip b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip ; rm b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa.zip
- And rename the folder: Run
mv bist-parser-b21e8691c2a8a8b2dadf8d31c28cf39ed19ae0aa bist_parser
- Get universal dependencies data:
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11234/1-1699/ud-treebanks-v1.3.tgz?sequence=1&isAllowed=y
- Rename:
mv ud-treebanks-v1.3.tgz\?sequence\=1 ud-treebanks-v1.3.tgz
- Unzip and remove:
tar zxvf ud-treebanks-v1.3.tgz ; rm ud-treebanks-v1.3.tgz
- Make directories for parsing results:
mkdir bist_parser/barchybrid/results_cbow ; mkdir bist_parser/barchybrid/results_ssg ; mkdir bist_parser/bmstparser/results_cbow ; mkdir bist_parser/bmstparser/results_ssg
- Remove all non-Dutch treebank data:
cd ud-treebanks-v1.3/ ; ls | grep -vP "UD_Dut.*" | parallel rm -r ; cd ..
- Copy the training script to the subfolders:
cp train_parser.sh bist_parser/barchybrid/ ; cp train_parser.sh bist_parser/bmstparser/
To do: