Skip to content

Instantly share code, notes, and snippets.

@n0nick
Created August 25, 2011 09:33
Show Gist options
  • Save n0nick/1170318 to your computer and use it in GitHub Desktop.
Save n0nick/1170318 to your computer and use it in GitHub Desktop.
mt coverage on wikipedia corpus (split to 4 parts)
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:31]
$ cat ../maltese/wikipedia/xaa | sh dev/coverage.sh mt-he.automorf.bin
Number of tokenised words in the corpus: 37016
Number of known words in the corpus: 29006
Coverage: 78.4 %
Top unknown words in the corpus:
41 ^Renju/*Renju$
37 ^Unit/*Unit$
30 ^Kavallieri/*Kavallieri$
23 ^de/*de$
22 ^Vassalli/*Vassalli$
22 ^Imperu/*Imperu$
17 ^sehem/*sehem$
16 ^naħa/*naħa$
15 ^Ġove/*Ġove$
15 ^għalkemm/*għalkemm$
Tokens needed to get 80 %: 606.8 . Corresponding wordlist in /tmp/corpus-stat-needed.txt
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:32]
$ cat ../maltese/wikipedia/xab | sh dev/coverage.sh mt-he.automorf.bin
Number of tokenised words in the corpus: 34658
Number of known words in the corpus: 26542
Coverage: 76.6 %
Top unknown words in the corpus:
134 ^Opossum/*Opossum$
123 ^Mammalia/*Mammalia$
122 ^Theria/*Theria$
118 ^opossumi/*opossumi$
118 ^Didelphimorphia/*Didelphimorphia$
118 ^Didelphidae/*Didelphidae$
69 ^didelfimorfju/*didelfimorfju$
69 ^didelfidu/*didelfidu$
66 ^ispeċi/*ispeċi$
63 ^Plesiadapiformes/*Plesiadapiformes$
Tokens needed to get 80 %: 1184.4 . Corresponding wordlist in /tmp/corpus-stat-needed.txt
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:32]
$ cat ../maltese/wikipedia/xac | sh dev/coverage.sh mt-he.automorf.bin
Number of tokenised words in the corpus: 37181
Number of known words in the corpus: 29593
Coverage: 79.6 %
Top unknown words in the corpus:
50 ^skorja/*skorja$
35 ^Cup/*Cup$
27 ^FA/*FA$
25 ^de/*de$
25 ^Maja/*Maja$
24 ^tmiem/*tmiem$
20 ^reġa/*reġa$
20 ^naħa/*naħa$
19 ^ġebel/*ġebel$
19 ^City/*City$
Tokens needed to get 80 %: 151.8 . Corresponding wordlist in /tmp/corpus-stat-needed.txt
/Users/sagiemaoz/Projects/gsoc/apertium-mt-he [git::master *] [sagiemaoz@sagiem-mac] [12:33]
$ cat ../maltese/wikipedia/xad | sh dev/coverage.sh mt-he.automorf.bin
Number of tokenised words in the corpus: 37453
Number of known words in the corpus: 29372
Coverage: 78.4 %
Top unknown words in the corpus:
46 ^skorja/*skorja$
43 ^album/*album$
39 ^tiċċelebra/*tiċċelebra$
30 ^naħal/*naħal$
27 ^de/*de$
22 ^età/*età$
21 ^tmiem/*tmiem$
21 ^Real/*Real$
17 ^ossiġenu/*ossiġenu$
17 ^għadu/*għadu$
Tokens needed to get 80 %: 590.4 . Corresponding wordlist in /tmp/corpus-stat-needed.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment