Created
July 8, 2014 17:27
-
-
Save piskvorky/eaa837b370b8543e8576 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ python -m gensim.scripts.make_wiki ~/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en | |
2014-07-08 18:44:22,009 : INFO : running /Volumes/work/workspace/gensim/trunk/gensim/scripts/make_wiki.py /Users/kofola/data/wiki/simplewiki-20140623-pages-articles.xml.bz2 simplewiki_en | |
2014-07-08 18:44:22,162 : INFO : adding document #0 to Dictionary(0 unique tokens: []) | |
2014-07-08 18:44:48,429 : INFO : adding document #10000 to Dictionary(116699 unique tokens: [u'fawn', u'refreshable', u'idaira', u'clottey', u'gavar']...) | |
2014-07-08 18:45:05,198 : INFO : adding document #20000 to Dictionary(159070 unique tokens: [u'fawn', u'biennials', u'\u03c9\u0431\u0440\u0430\u0434\u043e\u0432\u0430\u043d\u043d\u0430\u0467', u'refreshable', u'grandniece']...) | |
2014-07-08 18:45:19,946 : INFO : adding document #30000 to Dictionary(198077 unique tokens: [u'biennials', u'idaira', u'clottey', u'gavar', u'experimeter']...) | |
2014-07-08 18:45:37,237 : INFO : adding document #40000 to Dictionary(232401 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'idaira']...) | |
2014-07-08 18:45:53,758 : INFO : adding document #50000 to Dictionary(261720 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'vang']...) | |
2014-07-08 18:46:12,792 : INFO : adding document #60000 to Dictionary(288641 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'klatki']...) | |
2014-07-08 18:46:33,571 : INFO : adding document #70000 to Dictionary(326692 unique tokens: [u'biennials', u'sowela', u'mdbg', u'clottes', u'klatki']...) | |
2014-07-08 18:46:51,268 : INFO : adding document #80000 to Dictionary(358238 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...) | |
2014-07-08 18:47:08,034 : INFO : adding document #90000 to Dictionary(391235 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...) | |
2014-07-08 18:47:19,986 : INFO : adding document #100000 to Dictionary(403563 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...) | |
2014-07-08 18:47:32,656 : INFO : adding document #110000 to Dictionary(417230 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...) | |
2014-07-08 18:47:34,601 : INFO : finished iterating over Wikipedia corpus of 111516 documents with 18341931 positions (total 193436 articles, 19530345 positions before pruning articles shorter than 50 words) | |
2014-07-08 18:47:34,601 : INFO : built Dictionary(419205 unique tokens: [u'biennials', u'sowela', u'mdbg', u'biysk', u'sermersheim']...) from 111516 documents (total 18341931 corpus positions) | |
2014-07-08 18:47:34,847 : INFO : keeping 28139 tokens which were in no less than 20 and no more than 11151 (=10.0%) documents | |
2014-07-08 18:47:35,165 : INFO : resulting dictionary: Dictionary(28139 unique tokens: [u'fawn', u'schlegel', u'sonja', u'woods', u'spiders']...) | |
2014-07-08 18:47:35,166 : INFO : storing corpus in Matrix Market format to simplewiki_en_bow.mm | |
2014-07-08 18:47:35,168 : INFO : saving sparse matrix to simplewiki_en_bow.mm | |
2014-07-08 18:47:35,326 : INFO : PROGRESS: saving document #0 | |
2014-07-08 18:48:10,303 : INFO : PROGRESS: saving document #10000 | |
2014-07-08 18:48:33,232 : INFO : PROGRESS: saving document #20000 | |
2014-07-08 18:48:53,415 : INFO : PROGRESS: saving document #30000 | |
2014-07-08 18:49:16,726 : INFO : PROGRESS: saving document #40000 | |
2014-07-08 18:49:40,372 : INFO : PROGRESS: saving document #50000 | |
2014-07-08 18:50:01,413 : INFO : PROGRESS: saving document #60000 | |
2014-07-08 18:50:20,544 : INFO : PROGRESS: saving document #70000 | |
2014-07-08 18:50:35,360 : INFO : PROGRESS: saving document #80000 | |
2014-07-08 18:50:50,749 : INFO : PROGRESS: saving document #90000 | |
2014-07-08 18:51:03,581 : INFO : PROGRESS: saving document #100000 | |
2014-07-08 18:51:17,333 : INFO : PROGRESS: saving document #110000 | |
2014-07-08 18:51:19,489 : INFO : finished iterating over Wikipedia corpus of 111516 documents with 18341931 positions (total 193436 articles, 19530345 positions before pruning articles shorter than 50 words) | |
2014-07-08 18:51:19,489 : INFO : saved 111516x28139 matrix, density=0.200% (6284690/3137948724) | |
2014-07-08 18:51:19,490 : INFO : saving MmCorpus index to simplewiki_en_bow.mm.index | |
2014-07-08 18:51:19,517 : INFO : saving dictionary mapping to simplewiki_en_wordids.txt.bz2 | |
2014-07-08 18:51:20,147 : INFO : loaded corpus index from simplewiki_en_bow.mm.index | |
2014-07-08 18:51:20,147 : INFO : initializing corpus reader from simplewiki_en_bow.mm | |
2014-07-08 18:51:20,147 : INFO : accepted corpus with 111516 documents, 28139 features, 6284690 non-zero entries | |
2014-07-08 18:51:20,147 : INFO : collecting document frequencies | |
2014-07-08 18:51:20,153 : INFO : PROGRESS: processing document #0 | |
2014-07-08 18:51:29,844 : INFO : PROGRESS: processing document #10000 | |
2014-07-08 18:51:35,779 : INFO : PROGRESS: processing document #20000 | |
2014-07-08 18:51:40,750 : INFO : PROGRESS: processing document #30000 | |
2014-07-08 18:51:45,936 : INFO : PROGRESS: processing document #40000 | |
2014-07-08 18:51:50,114 : INFO : PROGRESS: processing document #50000 | |
2014-07-08 18:51:55,111 : INFO : PROGRESS: processing document #60000 | |
2014-07-08 18:52:00,033 : INFO : PROGRESS: processing document #70000 | |
2014-07-08 18:52:03,664 : INFO : PROGRESS: processing document #80000 | |
2014-07-08 18:52:07,419 : INFO : PROGRESS: processing document #90000 | |
2014-07-08 18:52:10,415 : INFO : PROGRESS: processing document #100000 | |
2014-07-08 18:52:13,681 : INFO : PROGRESS: processing document #110000 | |
2014-07-08 18:52:14,228 : INFO : calculating IDF weights for 111516 documents and 28138 features (6284690 matrix non-zeros) | |
2014-07-08 18:52:14,256 : INFO : storing corpus in Matrix Market format to simplewiki_en_tfidf.mm | |
2014-07-08 18:52:14,256 : INFO : saving sparse matrix to simplewiki_en_tfidf.mm | |
2014-07-08 18:52:14,264 : INFO : PROGRESS: saving document #0 | |
2014-07-08 18:52:35,928 : INFO : PROGRESS: saving document #10000 | |
2014-07-08 18:52:49,482 : INFO : PROGRESS: saving document #20000 | |
2014-07-08 18:53:00,824 : INFO : PROGRESS: saving document #30000 | |
2014-07-08 18:53:12,513 : INFO : PROGRESS: saving document #40000 | |
2014-07-08 18:53:21,943 : INFO : PROGRESS: saving document #50000 | |
2014-07-08 18:53:33,094 : INFO : PROGRESS: saving document #60000 | |
2014-07-08 18:53:44,313 : INFO : PROGRESS: saving document #70000 | |
2014-07-08 18:53:52,553 : INFO : PROGRESS: saving document #80000 | |
2014-07-08 18:54:01,061 : INFO : PROGRESS: saving document #90000 | |
2014-07-08 18:54:07,902 : INFO : PROGRESS: saving document #100000 | |
2014-07-08 18:54:15,407 : INFO : PROGRESS: saving document #110000 | |
2014-07-08 18:54:16,649 : INFO : saved 111516x28139 matrix, density=0.200% (6284690/3137948724) | |
2014-07-08 18:54:16,650 : INFO : saving MmCorpus index to simplewiki_en_tfidf.mm.index | |
2014-07-08 18:54:16,675 : INFO : finished running make_wiki.py |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment