Skip to content

Instantly share code, notes, and snippets.

@ravishchawla
Last active June 27, 2018 18:59
Show Gist options
  • Save ravishchawla/421dcd4a46e7b691574cc4eb56df0724 to your computer and use it in GitHub Desktop.
Save ravishchawla/421dcd4a46e7b691574cc4eb56df0724 to your computer and use it in GitHub Desktop.
# Use a Keras Tokenizer and fit on the sentences
tokenizer = Tokenizer();
tokenizer.fit_on_texts(sentences);
text_sequences = np.array(tokenizer.texts_to_sequences(sentences));
sequence_dict = tokenizer.word_index;
word_dict = dict((num, val) for (val, num) in sequence_dict.items());
# We get a map of encoding-to-word in sequence_dict
# Generate encoded reviews
reviews_encoded = [];
for i,review in enumerate(review_cleans):
reviews_encoded.append([sequence_dict[x] for x in review]);
# Plot a Histogram of length of reviews
lengths = [len(x) for x in reviews_encoded];
with plt.xkcd():
plt.hist(lengths, bins=range(100))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment