leondz · July 25, 2013 18:52
diff --git a/gistfile1.txt b/gistfile1.txt
 Comparing the results over Ritter's twitter tagging dataset
 with Owoputi et al.'s NAACL 2013 paper

 c.f. https://gist.github.com/brendano/6070886

 The Ritter dataset is small, single-annotator, and there are 
 arguments against using PTB on this genre. As twitter pos tagging was
 difficult, we took a principled approach to improving it, based on
 empirical investigations and error analysis which form a core
 part of the work.

 Further, as twitter pos-labelled linguistic resources are scarce and
 annotated according to heterogeneous schemes, we developed a simple
 bootstrapping method for building high-confidence datasets automatically.

 As we needed annotated data for our investigation, we took random 
 splits of the already-small corpus at document (tweet) level. 
 Critically we had a development split, and a held-out evaluation
 split of just 2 291 tokens.

 We found that, although Ritter reported ~88.3% token-level accuracy
 using four-fold cross-validation, they only reached 84.6% on this
 evaluation split training on the same data as us. This indicates that 
 this split is not representative of the dataset, being instead more 
 challenging - a good acid test, considering that splitting the corpus 
 is required. Despite the wider CI from the small size of this set, our 
 improvement is strong - note miniscule p-val.

 > derczynski_t_eval={p=0.8869;n=2291;  c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
 > ritter_t_eval={p=0.8455;n=2291;  c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
 > t.test(derczynski_t_eval,ritter_t_eval,alt='greater')
 	Welch Two Sample t-test
 data:  derczynski_t_eval and ritter_t_eval 
 t = 4.1295, df = 4502.115, p-value = 1.851e-05
 alternative hypothesis: true difference in means is greater than 0 
 95 percent confidence interval:
 0.02494613        Inf 


 Where the Owoputi work lands in these conditions is unclear. Also,
 given the magnitude of the variations, the estimate taken from the
 whole dataset is insufficiently representative. It may indeed be that
 our eval split contained unusual and informative examples. The dataset
 is small, but must be partitioned, and is certainly quite tainted 
 after this work.

 The evaluation in the Owoputi NAACL paper, which crossed with our
 paper en route to publication, is over the entire dataset. This reduces
 comparability in exchange for confidence: we already have a strong
 indication that our eval split is "harder" than the whole thing.
 Also, Section 6.2 suggests a slight difference in the available
 training data (70% vs. 75%) though the mere suggestion that this would
 make a discernably difference - and it's plausible it would - speaks
 more to the insufficiency of the data volume we're dealing with.

 Take for example Derczynski performance on the development set:

 > owoputi={p=.9;n=15185;  c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
 > derczynski_t_dev={p=.9054;n=2232;  c(rep(1,round(p*n)), rep(0,round((1-p)*n)))}
 > t.test(owoputi,derczynski_t_dev,alt='greater')
 data:  owoputi and derczynski_dev 
 t = -0.8173, df = 2963.051, p-value = 0.7931
 alternative hypothesis: true difference in means is greater than 0 

 It's quite a difference from the eval set. And check out the margins:

 95 percent confidence interval:
 -0.01639009         Inf 


 The first result's p-value is unsatisfactory, and CI is large - but we
 have no workaround for this right now, and the argument is already made 
 that more annotated data is needed. However the result is markedly 
 different from the picture examining T-eval vs. Owoputi, as one might
 expect after comparing other prior system's performance on the eval split.

 A part-of-speech tagged tweet dataset with annotator agreement is 
 available, from DCU (Foster et al.). However, the tokenisation and tag
 selection rules are here, and it seems biased toward well-formed 
 utterances, which is in line with the goals of the research that 
 produced it: parsing tweets. So, while comparison could be performed
 over this entire dataset (reducing CIs to something more pleasant),
 it may not be meaningful.

 In any event, we continued and performed this evaluation, with roughly
 the same results (though sentence-level improvement was smaller: as
 one might guess given the corpus creation bias). This seems a reasonable
 opportunity to advocate the reporting of whole-sentence accuracy rates
 in PoS tagging: for arguments concerning this, see the Manning reference
 in our paper.

 Regarding the linguistic aspects of designing pos tagging for tweets:
 it's almost tempting to throw the whole away, induce a new tagset, and 
 go from there - but there are so many cases that look properly structured, 
 and thus could be better processed based on current knowledge. Switching
 tagset this severs links with existing resources. However, a custom, 
 condensed tagset is bound to be easier to label with and provides less 
 sparse data to downstream tools (often desirable). Is this the first 
 time a genre has spawned its own tagset? It's exciting new ground.

 To conclude, we are short on data. The eval split is quite different 
 from the overall dataset, and a comparison between 4-fold XV and this 
 split is for the reasons stated above and in the paper neither 
 scientifically rigourous nor appropriate. However, it is clear we don't
 have a sufficiently high quality high-size resource to give comparisons 
 having very low p-values, either, which removes the most satisfactory 
 route to a resolution.
	Comparing the results over Ritter's twitter tagging dataset
	with Owoputi et al.'s NAACL 2013 paper

	c.f. https://gist.github.com/brendano/6070886

	The Ritter dataset is small, single-annotator, and there are
	arguments against using PTB on this genre. As twitter pos tagging was
	difficult, we took a principled approach to improving it, based on
	empirical investigations and error analysis which form a core
	part of the work.

	Further, as twitter pos-labelled linguistic resources are scarce and
	annotated according to heterogeneous schemes, we developed a simple
	bootstrapping method for building high-confidence datasets automatically.

	As we needed annotated data for our investigation, we took random
	splits of the already-small corpus at document (tweet) level.
	Critically we had a development split, and a held-out evaluation
	split of just 2 291 tokens.

	We found that, although Ritter reported ~88.3% token-level accuracy
	using four-fold cross-validation, they only reached 84.6% on this
	evaluation split training on the same data as us. This indicates that
	this split is not representative of the dataset, being instead more
	challenging - a good acid test, considering that splitting the corpus
	is required. Despite the wider CI from the small size of this set, our
	improvement is strong - note miniscule p-val.

	> derczynski_t_eval={p=0.8869;n=2291; c(rep(1,round(pn)), rep(0,round((1-p)n)))}
	> ritter_t_eval={p=0.8455;n=2291; c(rep(1,round(pn)), rep(0,round((1-p)n)))}
	> t.test(derczynski_t_eval,ritter_t_eval,alt='greater')
	Welch Two Sample t-test
	data: derczynski_t_eval and ritter_t_eval
	t = 4.1295, df = 4502.115, p-value = 1.851e-05
	alternative hypothesis: true difference in means is greater than 0
	95 percent confidence interval:
	0.02494613 Inf


	Where the Owoputi work lands in these conditions is unclear. Also,
	given the magnitude of the variations, the estimate taken from the
	whole dataset is insufficiently representative. It may indeed be that
	our eval split contained unusual and informative examples. The dataset
	is small, but must be partitioned, and is certainly quite tainted
	after this work.

	The evaluation in the Owoputi NAACL paper, which crossed with our
	paper en route to publication, is over the entire dataset. This reduces
	comparability in exchange for confidence: we already have a strong
	indication that our eval split is "harder" than the whole thing.
	Also, Section 6.2 suggests a slight difference in the available
	training data (70% vs. 75%) though the mere suggestion that this would
	make a discernably difference - and it's plausible it would - speaks
	more to the insufficiency of the data volume we're dealing with.

	Take for example Derczynski performance on the development set:

	> owoputi={p=.9;n=15185; c(rep(1,round(pn)), rep(0,round((1-p)n)))}
	> derczynski_t_dev={p=.9054;n=2232; c(rep(1,round(pn)), rep(0,round((1-p)n)))}
	> t.test(owoputi,derczynski_t_dev,alt='greater')
	data: owoputi and derczynski_dev
	t = -0.8173, df = 2963.051, p-value = 0.7931
	alternative hypothesis: true difference in means is greater than 0

	It's quite a difference from the eval set. And check out the margins:

	95 percent confidence interval:
	-0.01639009 Inf


	The first result's p-value is unsatisfactory, and CI is large - but we
	have no workaround for this right now, and the argument is already made
	that more annotated data is needed. However the result is markedly
	different from the picture examining T-eval vs. Owoputi, as one might
	expect after comparing other prior system's performance on the eval split.

	A part-of-speech tagged tweet dataset with annotator agreement is
	available, from DCU (Foster et al.). However, the tokenisation and tag
	selection rules are here, and it seems biased toward well-formed
	utterances, which is in line with the goals of the research that
	produced it: parsing tweets. So, while comparison could be performed
	over this entire dataset (reducing CIs to something more pleasant),
	it may not be meaningful.

	In any event, we continued and performed this evaluation, with roughly
	the same results (though sentence-level improvement was smaller: as
	one might guess given the corpus creation bias). This seems a reasonable
	opportunity to advocate the reporting of whole-sentence accuracy rates
	in PoS tagging: for arguments concerning this, see the Manning reference
	in our paper.

	Regarding the linguistic aspects of designing pos tagging for tweets:
	it's almost tempting to throw the whole away, induce a new tagset, and
	go from there - but there are so many cases that look properly structured,
	and thus could be better processed based on current knowledge. Switching
	tagset this severs links with existing resources. However, a custom,
	condensed tagset is bound to be easier to label with and provides less
	sparse data to downstream tools (often desirable). Is this the first
	time a genre has spawned its own tagset? It's exciting new ground.

	To conclude, we are short on data. The eval split is quite different
	from the overall dataset, and a comparison between 4-fold XV and this
	split is for the reasons stated above and in the paper neither
	scientifically rigourous nor appropriate. However, it is clear we don't
	have a sufficiently high quality high-size resource to give comparisons
	having very low p-values, either, which removes the most satisfactory
	route to a resolution.