Created
May 25, 2015 11:45
-
-
Save nkt1546789/e9fc84579b9c8356f1e5 to your computer and use it in GitHub Desktop.
creating cooccurrence matrix on Python using scipy.sparse.coo_matrix
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| def create_cooccurrence_matrix(filename,tokenizer,window_size): | |
| vocabulary={} | |
| data=[] | |
| row=[] | |
| col=[] | |
| for sentence in codecs.open(filename,"r","utf-8"): | |
| sentence=sentence.strip() | |
| tokens=[token for token in tokenizer(sentence) if token!=u""] | |
| for pos,token in enumerate(tokens): | |
| i=vocabulary.setdefault(token,len(vocabulary)) | |
| start=max(0,pos-window_size) | |
| end=min(len(tokens),pos+window_size+1) | |
| for pos2 in xrange(start,end): | |
| if pos2==pos: | |
| continue | |
| j=vocabulary.setdefault(tokens[pos2],len(vocabulary)) | |
| data.append(1.); row.append(i); col.append(j); | |
| cooccurrence_matrix=sparse.coo_matrix((data,(row,col))) | |
| return vocabulary,cooccurrence_matrix |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi!
Could you,please, show how you use this function in code?
Thank you!