These scripts produce the train-dev-test splits for the Tiger & Lassy treebanks
used in my 2013 IWPT paper. The Tiger treebank version 2.1 was used, namely
tiger_release_aug07.export
. The Lassy treebank was version 1.1, or
lassy-r19749
. The reason for not just taking the last 20% for the
development & test set is to ensure a balanced distribution of sentences, which
otherwise would have an uneven distribution of length & topics.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python3 | |
__author__ = 'Dmitry Ustalov' | |
__credits__ = 'Sebastian Padó' | |
__license__ = 'MIT' | |
# This is an MIT-licensed implementation of the sigf toolkit for randomization tests: | |
# https://nlpado.de/~sebastian/software/sigf.shtml | |
import random |