Skip to content

Instantly share code, notes, and snippets.

View piskvorky's full-sized avatar

Radim Řehůřek piskvorky

View GitHub Profile
$ easy_install gensim
Searching for gensim
Reading https://pypi.python.org/simple/gensim/
Best match: gensim 0.10.0rc1
Downloading https://pypi.python.org/packages/source/g/gensim/gensim-0.10.0rc1.tar.gz#md5=6bb7cad2ab922dbbcb8ffb0d876f83c7
Processing gensim-0.10.0rc1.tar.gz
Writing /var/folders/wy/80_9ndgx1pv2x5xgvyk0tq5r0000gn/T/easy_install-B4mVYJ/gensim-0.10.0rc1/setup.cfg
Running gensim-0.10.0rc1/setup.py -q bdist_egg --dist-dir /var/folders/wy/80_9ndgx1pv2x5xgvyk0tq5r0000gn/T/easy_install-B4mVYJ/gensim-0.10.0rc1/egg-dist-tmp-Zwfuwg
warning: no files found matching '*.sh' under directory '.'
no previously-included directories found matching 'docs/src*'
$ pip -v -v -v install --pre gensim
Downloading/unpacking gensim
Getting page https://pypi.python.org/simple/gensim/
URLs to search for versions for gensim:
* https://pypi.python.org/simple/gensim/
Analyzing links from page https://pypi.python.org/simple/gensim/
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.2-py2.5.egg#md5=6cd22bc391fb8e7620b6d5aa0b316a5a (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.3.0-py2.5.egg#md5=a2d0ef0fb9b4a6d7224ec102ddfb6670 (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.4-py2.5.egg#md5=c82cbd35bf6b686dd93048ed6c80ab70 (from https://pypi.python.org/simple/gensim/); unknown archive format: .egg
Skipping link https://pypi.python.org/packages/2.5/g/gensim/gensim-0.4.1-py2.5.egg#md5=fbfb31e1da91fc9249e59f42f3030431 (from https://pypi.python.org/sim
$ pip freeze
Bottleneck==0.7.0
CherryPy==3.2.4
Cython==0.19.1
Jinja2==2.6
Markdown==2.3.1
NearPy==0.1.2
Pattern==2.6
PyTrie==0.2
PyYAML==3.10
import numpy
# load a 10k x 10k array (800MB) previously stored with
# numpy.save('/tmp/x.npy', numpy.random.rand(10000, 10000))
shared = [numpy.load('/tmp/x.npy', mmap_mode='r') for _ in range(10)]
# touch all elements in all 10 copies of the same mmap'ed array
print [x.sum() for x in shared]
# ...resident/real mem spikes at 10x 800MB, nothing shared?
from cpython cimport PyCObject_AsVoidPtr
from scipy.linalg.blas import fblas
from libc.math cimport fabs
ctypedef float (*sdot_ptr) (const int *N, const float *X, const int *incX, const float *Y, const int *incY) nogil
cdef sdot_ptr sdot=<sdot_ptr>PyCObject_AsVoidPtr(fblas.sdot._cpointer)
ctypedef double (*dsdot_ptr) (const int *N, const float *X, const int *incX, const float *Y, const int *incY) nogil
cdef dsdot_ptr dsdot=<dsdot_ptr>PyCObject_AsVoidPtr(fblas.sdot._cpointer)
jak pocitat "autocitace"? =>
pro kazdy clanek A:
pro kazdou referenci B z clanku A:
jestlize maji autori B a autori A neprazdny prunik (=existuje aspon jeden spolecny autor v A i B), pridej B k "mnozine autocitaci clanku A"
---
a nasledne muzeme, pri zobrazeni clanku A, zobrazit take pocet autocitaci = velikost "mnoziny autocitaci A"
{
"from": 0,
"size": 100,
"query": {
"bool": {
"must_not": {
"terms": {
"prefix1": [
"a",
"b",
>>> mm2 = gensim.corpora.MmCorpus(bz2.BZ2File('./enwiki_bow.mm.bz2'))
INFO : initializing corpus reader from <bz2.BZ2File object at 0x1168d988>
INFO : accepted corpus with 3533010 documents, 50000 features, 525892746 non-zero entries
>>> hdp = gensim.models.HdpModel(corpus=mm, id2word=id2word, outputdir='/net/sojka-local/xrehurek/wiki/', chunksize=2048)

... some 18 hours later (single core):