Skip to content

Instantly share code, notes, and snippets.

@StrikingLoo
Created October 25, 2019 04:11
Show Gist options
  • Save StrikingLoo/a72ad5d3c5fc5968dd99572aa373a68d to your computer and use it in GitHub Desktop.
Save StrikingLoo/a72ad5d3c5fc5968dd99572aa373a68d to your computer and use it in GitHub Desktop.
k = 2 # adjustable
sets_of_k_words = [ ' '.join(corpus_words[i:i+k]) for i, _ in enumerate(corpus_words[:-k]) ]
from scipy.sparse import dok_matrix
sets_count = len(list(set(sets_of_k_words)))
next_after_k_words_matrix = dok_matrix((sets_count, len(distinct_words)))
distinct_sets_of_k_words = list(set(sets_of_k_words))
k_words_idx_dict = {word: i for i, word in enumerate(distinct_sets_of_k_words)}
for i, word in enumerate(sets_of_k_words[:-k]):
word_sequence_idx = k_words_idx_dict[word]
next_word_idx = word_idx_dict[corpus_words[i+k]]
next_after_k_words_matrix[word_sequence_idx, next_word_idx] +=1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment