Created
August 25, 2011 09:33
-
-
Save n0nick/1170318 to your computer and use it in GitHub Desktop.
mt coverage on wikipedia corpus (split to 4 parts)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:31] | |
$ cat ../maltese/wikipedia/xaa | sh dev/coverage.sh mt-he.automorf.bin | |
Number of tokenised words in the corpus: 37016 | |
Number of known words in the corpus: 29006 | |
Coverage: 78.4 % | |
Top unknown words in the corpus: | |
41 ^Renju/*Renju$ | |
37 ^Unit/*Unit$ | |
30 ^Kavallieri/*Kavallieri$ | |
23 ^de/*de$ | |
22 ^Vassalli/*Vassalli$ | |
22 ^Imperu/*Imperu$ | |
17 ^sehem/*sehem$ | |
16 ^naħa/*naħa$ | |
15 ^Ġove/*Ġove$ | |
15 ^għalkemm/*għalkemm$ | |
Tokens needed to get 80 %: 606.8 . Corresponding wordlist in /tmp/corpus-stat-needed.txt | |
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:32] | |
$ cat ../maltese/wikipedia/xab | sh dev/coverage.sh mt-he.automorf.bin | |
Number of tokenised words in the corpus: 34658 | |
Number of known words in the corpus: 26542 | |
Coverage: 76.6 % | |
Top unknown words in the corpus: | |
134 ^Opossum/*Opossum$ | |
123 ^Mammalia/*Mammalia$ | |
122 ^Theria/*Theria$ | |
118 ^opossumi/*opossumi$ | |
118 ^Didelphimorphia/*Didelphimorphia$ | |
118 ^Didelphidae/*Didelphidae$ | |
69 ^didelfimorfju/*didelfimorfju$ | |
69 ^didelfidu/*didelfidu$ | |
66 ^ispeċi/*ispeċi$ | |
63 ^Plesiadapiformes/*Plesiadapiformes$ | |
Tokens needed to get 80 %: 1184.4 . Corresponding wordlist in /tmp/corpus-stat-needed.txt | |
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:32] | |
$ cat ../maltese/wikipedia/xac | sh dev/coverage.sh mt-he.automorf.bin | |
Number of tokenised words in the corpus: 37181 | |
Number of known words in the corpus: 29593 | |
Coverage: 79.6 % | |
Top unknown words in the corpus: | |
50 ^skorja/*skorja$ | |
35 ^Cup/*Cup$ | |
27 ^FA/*FA$ | |
25 ^de/*de$ | |
25 ^Maja/*Maja$ | |
24 ^tmiem/*tmiem$ | |
20 ^reġa/*reġa$ | |
20 ^naħa/*naħa$ | |
19 ^ġebel/*ġebel$ | |
19 ^City/*City$ | |
Tokens needed to get 80 %: 151.8 . Corresponding wordlist in /tmp/corpus-stat-needed.txt | |
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:33] | |
$ cat ../maltese/wikipedia/xad | sh dev/coverage.sh mt-he.automorf.bin | |
Number of tokenised words in the corpus: 37453 | |
Number of known words in the corpus: 29372 | |
Coverage: 78.4 % | |
Top unknown words in the corpus: | |
46 ^skorja/*skorja$ | |
43 ^album/*album$ | |
39 ^tiċċelebra/*tiċċelebra$ | |
30 ^naħal/*naħal$ | |
27 ^de/*de$ | |
22 ^età/*età$ | |
21 ^tmiem/*tmiem$ | |
21 ^Real/*Real$ | |
17 ^ossiġenu/*ossiġenu$ | |
17 ^għadu/*għadu$ | |
Tokens needed to get 80 %: 590.4 . Corresponding wordlist in /tmp/corpus-stat-needed.txt |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment