Last active
June 27, 2018 18:59
-
-
Save ravishchawla/421dcd4a46e7b691574cc4eb56df0724 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Use a Keras Tokenizer and fit on the sentences | |
tokenizer = Tokenizer(); | |
tokenizer.fit_on_texts(sentences); | |
text_sequences = np.array(tokenizer.texts_to_sequences(sentences)); | |
sequence_dict = tokenizer.word_index; | |
word_dict = dict((num, val) for (val, num) in sequence_dict.items()); | |
# We get a map of encoding-to-word in sequence_dict | |
# Generate encoded reviews | |
reviews_encoded = []; | |
for i,review in enumerate(review_cleans): | |
reviews_encoded.append([sequence_dict[x] for x in review]); | |
# Plot a Histogram of length of reviews | |
lengths = [len(x) for x in reviews_encoded]; | |
with plt.xkcd(): | |
plt.hist(lengths, bins=range(100)) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment