Created
July 25, 2013 18:52
-
-
Save leondz/6082658 to your computer and use it in GitHub Desktop.
Comparing the results over Ritter's twitter tagging dataset
with Owoputi et al.'s NAACL 2013 paper
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Comparing the results over Ritter's twitter tagging dataset | |
| with Owoputi et al.'s NAACL 2013 paper | |
| c.f. https://gist.github.com/brendano/6070886 | |
| The Ritter dataset is small, single-annotator, and there are | |
| arguments against using PTB on this genre. As twitter pos tagging was | |
| difficult, we took a principled approach to improving it, based on | |
| empirical investigations and error analysis which form a core | |
| part of the work. | |
| Further, as twitter pos-labelled linguistic resources are scarce and | |
| annotated according to heterogeneous schemes, we developed a simple | |
| bootstrapping method for building high-confidence datasets automatically. | |
| As we needed annotated data for our investigation, we took random | |
| splits of the already-small corpus at document (tweet) level. | |
| Critically we had a development split, and a held-out evaluation | |
| split of just 2 291 tokens. | |
| We found that, although Ritter reported ~88.3% token-level accuracy | |
| using four-fold cross-validation, they only reached 84.6% on this | |
| evaluation split training on the same data as us. This indicates that | |
| this split is not representative of the dataset, being instead more | |
| challenging - a good acid test, considering that splitting the corpus | |
| is required. Despite the wider CI from the small size of this set, our | |
| improvement is strong - note miniscule p-val. | |
| > derczynski_t_eval={p=0.8869;n=2291; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))} | |
| > ritter_t_eval={p=0.8455;n=2291; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))} | |
| > t.test(derczynski_t_eval,ritter_t_eval,alt='greater') | |
| Welch Two Sample t-test | |
| data: derczynski_t_eval and ritter_t_eval | |
| t = 4.1295, df = 4502.115, p-value = 1.851e-05 | |
| alternative hypothesis: true difference in means is greater than 0 | |
| 95 percent confidence interval: | |
| 0.02494613 Inf | |
| Where the Owoputi work lands in these conditions is unclear. Also, | |
| given the magnitude of the variations, the estimate taken from the | |
| whole dataset is insufficiently representative. It may indeed be that | |
| our eval split contained unusual and informative examples. The dataset | |
| is small, but must be partitioned, and is certainly quite tainted | |
| after this work. | |
| The evaluation in the Owoputi NAACL paper, which crossed with our | |
| paper en route to publication, is over the entire dataset. This reduces | |
| comparability in exchange for confidence: we already have a strong | |
| indication that our eval split is "harder" than the whole thing. | |
| Also, Section 6.2 suggests a slight difference in the available | |
| training data (70% vs. 75%) though the mere suggestion that this would | |
| make a discernably difference - and it's plausible it would - speaks | |
| more to the insufficiency of the data volume we're dealing with. | |
| Take for example Derczynski performance on the development set: | |
| > owoputi={p=.9;n=15185; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))} | |
| > derczynski_t_dev={p=.9054;n=2232; c(rep(1,round(p*n)), rep(0,round((1-p)*n)))} | |
| > t.test(owoputi,derczynski_t_dev,alt='greater') | |
| data: owoputi and derczynski_dev | |
| t = -0.8173, df = 2963.051, p-value = 0.7931 | |
| alternative hypothesis: true difference in means is greater than 0 | |
| It's quite a difference from the eval set. And check out the margins: | |
| 95 percent confidence interval: | |
| -0.01639009 Inf | |
| The first result's p-value is unsatisfactory, and CI is large - but we | |
| have no workaround for this right now, and the argument is already made | |
| that more annotated data is needed. However the result is markedly | |
| different from the picture examining T-eval vs. Owoputi, as one might | |
| expect after comparing other prior system's performance on the eval split. | |
| A part-of-speech tagged tweet dataset with annotator agreement is | |
| available, from DCU (Foster et al.). However, the tokenisation and tag | |
| selection rules are here, and it seems biased toward well-formed | |
| utterances, which is in line with the goals of the research that | |
| produced it: parsing tweets. So, while comparison could be performed | |
| over this entire dataset (reducing CIs to something more pleasant), | |
| it may not be meaningful. | |
| In any event, we continued and performed this evaluation, with roughly | |
| the same results (though sentence-level improvement was smaller: as | |
| one might guess given the corpus creation bias). This seems a reasonable | |
| opportunity to advocate the reporting of whole-sentence accuracy rates | |
| in PoS tagging: for arguments concerning this, see the Manning reference | |
| in our paper. | |
| Regarding the linguistic aspects of designing pos tagging for tweets: | |
| it's almost tempting to throw the whole away, induce a new tagset, and | |
| go from there - but there are so many cases that look properly structured, | |
| and thus could be better processed based on current knowledge. Switching | |
| tagset this severs links with existing resources. However, a custom, | |
| condensed tagset is bound to be easier to label with and provides less | |
| sparse data to downstream tools (often desirable). Is this the first | |
| time a genre has spawned its own tagset? It's exciting new ground. | |
| To conclude, we are short on data. The eval split is quite different | |
| from the overall dataset, and a comparison between 4-fold XV and this | |
| split is for the reasons stated above and in the paper neither | |
| scientifically rigourous nor appropriate. However, it is clear we don't | |
| have a sufficiently high quality high-size resource to give comparisons | |
| having very low p-values, either, which removes the most satisfactory | |
| route to a resolution. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment