Skip to content

Instantly share code, notes, and snippets.

@DanielDaCosta
Created May 6, 2020 02:07
Show Gist options
  • Save DanielDaCosta/2f0eef53adb311443eba49e08a807eb0 to your computer and use it in GitHub Desktop.
Save DanielDaCosta/2f0eef53adb311443eba49e08a807eb0 to your computer and use it in GitHub Desktop.
Keras tokenizer for Medium
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
vocabulary_size = 20000 # Choosing size of vocabulary
tokenizer = Tokenizer(num_words=vocabulary_size)
tokenizer.fit_on_texts(df['message'])
sequences = tokenizer.texts_to_sequences(df['message'])
# Pads sequences to the same length: MAXLEN
MAXLEN = 50
X = pad_sequences(sequences, maxlen=MAXLEN)
y = df[output_columns_all]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state = 42)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment