Skip to content

Instantly share code, notes, and snippets.

@leondz
Created July 25, 2013 18:52
Show Gist options
  • Save leondz/6082658 to your computer and use it in GitHub Desktop.
Save leondz/6082658 to your computer and use it in GitHub Desktop.
Comparing the results over Ritter's twitter tagging dataset with Owoputi et al.'s NAACL 2013 paper
Comparing the results over Ritter's twitter tagging dataset
with Owoputi et al.'s NAACL 2013 paper
c.f. https://gist.github.com/brendano/6070886
The Ritter dataset is small, single-annotator, and there are
arguments against using PTB on this genre. As twitter pos tagging was
difficult, we took a principled approach to improving it, based on
empirical investigations and error analysis which form a core
part of the work.
Further, as twitter pos-labelled linguistic resources are scarce and
annotated according to heterogeneous schemes, we developed a simple
bootstrapping method for building high-confidence datasets automatically.
As we needed annotated data for our investigation, we took random
splits of the already-small corpus at document (tweet) level.
Critically we had a development split, and a held-out evaluation
split of just 2 291 tokens.
We found that, although Ritter reported ~88.3% token-level accuracy
using four-fold cross-validation, they only reached 84.6% on this
evaluation split training on the same data as us. This indicates that
this split is not representative of the dataset, being instead more
challenging - a good acid test, considering that splitting the corpus
is required. Despite the wider CI from the small size of this set, our
improvement is strong - note miniscule p-val.
> derczynski_t_eval={p=0.8869;n=2291; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> ritter_t_eval={p=0.8455;n=2291; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> t.test(derczynski_t_eval,ritter_t_eval,alt='greater')
Welch Two Sample t-test
data: derczynski_t_eval and ritter_t_eval
t = 4.1295, df = 4502.115, p-value = 1.851e-05
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
0.02494613 Inf
Where the Owoputi work lands in these conditions is unclear. Also,
given the magnitude of the variations, the estimate taken from the
whole dataset is insufficiently representative. It may indeed be that
our eval split contained unusual and informative examples. The dataset
is small, but must be partitioned, and is certainly quite tainted
after this work.
The evaluation in the Owoputi NAACL paper, which crossed with our
paper en route to publication, is over the entire dataset. This reduces
comparability in exchange for confidence: we already have a strong
indication that our eval split is "harder" than the whole thing.
Also, Section 6.2 suggests a slight difference in the available
training data (70% vs. 75%) though the mere suggestion that this would
make a discernably difference - and it's plausible it would - speaks
more to the insufficiency of the data volume we're dealing with.
Take for example Derczynski performance on the development set:
> owoputi={p=.9;n=15185; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> derczynski_t_dev={p=.9054;n=2232; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
> t.test(owoputi,derczynski_t_dev,alt='greater')
data: owoputi and derczynski_dev
t = -0.8173, df = 2963.051, p-value = 0.7931
alternative hypothesis: true difference in means is greater than 0
It's quite a difference from the eval set. And check out the margins:
95 percent confidence interval:
-0.01639009 Inf
The first result's p-value is unsatisfactory, and CI is large - but we
have no workaround for this right now, and the argument is already made
that more annotated data is needed. However the result is markedly
different from the picture examining T-eval vs. Owoputi, as one might
expect after comparing other prior system's performance on the eval split.
A part-of-speech tagged tweet dataset with annotator agreement is
available, from DCU (Foster et al.). However, the tokenisation and tag
selection rules are here, and it seems biased toward well-formed
utterances, which is in line with the goals of the research that
produced it: parsing tweets. So, while comparison could be performed
over this entire dataset (reducing CIs to something more pleasant),
it may not be meaningful.
In any event, we continued and performed this evaluation, with roughly
the same results (though sentence-level improvement was smaller: as
one might guess given the corpus creation bias). This seems a reasonable
opportunity to advocate the reporting of whole-sentence accuracy rates
in PoS tagging: for arguments concerning this, see the Manning reference
in our paper.
Regarding the linguistic aspects of designing pos tagging for tweets:
it's almost tempting to throw the whole away, induce a new tagset, and
go from there - but there are so many cases that look properly structured,
and thus could be better processed based on current knowledge. Switching
tagset this severs links with existing resources. However, a custom,
condensed tagset is bound to be easier to label with and provides less
sparse data to downstream tools (often desirable). Is this the first
time a genre has spawned its own tagset? It's exciting new ground.
To conclude, we are short on data. The eval split is quite different
from the overall dataset, and a comparison between 4-fold XV and this
split is for the reasons stated above and in the paper neither
scientifically rigourous nor appropriate. However, it is clear we don't
have a sufficiently high quality high-size resource to give comparisons
having very low p-values, either, which removes the most satisfactory
route to a resolution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment