Skip to content

Instantly share code, notes, and snippets.

@rohithteja
Created August 23, 2021 17:59
Show Gist options
  • Select an option

  • Save rohithteja/f9c23e5901484fc455031ff9363800e8 to your computer and use it in GitHub Desktop.

Select an option

Save rohithteja/f9c23e5901484fc455031ff9363800e8 to your computer and use it in GitHub Desktop.
LSTM vectorization
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
maxlen = 100
embedding_dim = 100
x = df.text.values
y = df.sentiment.astype("category").cat.codes.values
# train validation and test split
x_train, xtest, y_train, ytest = train_test_split(x,y,stratify=y,
test_size=0.20,
random_state=42)
x_val, x_test, y_val, y_test = train_test_split(xtest, ytest,
stratify=ytest,
test_size=0.5,
random_state=42)
y_train = to_categorical(y_train)
y_val = to_categorical(y_val)
y_test = to_categorical(y_test)
#tokenizing and padding
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(df.text.values)
X_train = tokenizer.texts_to_sequences(x_train)
X_val = tokenizer.texts_to_sequences(x_val)
X_test = tokenizer.texts_to_sequences(x_test)
vocab_size = len(tokenizer.word_index) + 1
X_train = pad_sequences(X_train, padding='pre', maxlen=maxlen)
X_val = pad_sequences(X_val, padding='pre', maxlen=maxlen)
X_test = pad_sequences(X_test, padding='pre', maxlen=maxlen)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment