Skip to content

Instantly share code, notes, and snippets.

View irdanish11's full-sized avatar
🎯
Focusing

Irfan Danish irdanish11

🎯
Focusing
View GitHub Profile
@irdanish11
irdanish11 / preprocessing.py
Last active October 16, 2019 08:14
Removing Redundant data, cleaning the data, and removing the sentences that are too short are too long.
#removing the redundant lines
start_time = time.time()
unique_data = []
for i in range(len(data)):
if data['description'][i] not in unique_data:
unique_data.append(data['description'][i])
if i % 5000 == 0:
print('{0}'.format(i)+' lines have been processed')
else:
None
@irdanish11
irdanish11 / corpus_read.py
Last active October 16, 2019 08:25
Reading and splitting the data into tokens.
def read_file(filepath):
with open(filepath) as f:
str_text = f.read()
return str_text
text = read_file('NameofYourFile.txt')
tokens = text.split(" ")
@irdanish11
irdanish11 / sequences.py
Created October 16, 2019 08:27
Converting the whole text into text sequences of four words and after that we count all the unique words and also count how many times a single word appeared in the corpus.
train_len = 3+1
text_sequences = []
for i in range(train_len,len(tokens)):
seq = tokens[i-train_len:i]
text_sequences.append(seq)
sequences = {}
count = 1
for i in range(len(tokens)):
if tokens[i] not in sequences:
@irdanish11
irdanish11 / tokenize.py
Created October 16, 2019 08:30
Map each word in the text sequences to a unique integer and after that convert the list of sequences into a numpy array.
tokenizer = Tokenizer()
tokenizer.fit_on_texts(text_sequences)
sequences = tokenizer.texts_to_sequences(text_sequences)
#Collecting some information
vocabulary_size = len(tokenizer.word_counts)
n_sequences = np.empty([len(sequences),train_len], dtype='int32')
for i in range(len(sequences)):
n_sequences[i] = sequences[i]
@irdanish11
irdanish11 / split_data
Created October 16, 2019 08:35
Splits the sequences into inputs and output labels for our model. As sequence length was 4, we use first three words and for that three words model will predict the word. The fourth word will be used as label. After that we convert our output labels into one hot vectors i.e into combinations of 0's and 1.
train_inputs = n_sequences[:,:-1]
train_targets = n_sequences[:,-1]
train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1)
seq_len = train_inputs.shape[1]
train_inputs.shape
@irdanish11
irdanish11 / split_data
Created October 16, 2019 08:35
Splits the sequences into inputs and output labels for our model. As sequence length was 4, we use first three words and for that three words model will predict the word. The fourth word will be used as label. After that we convert our output labels into one hot vectors i.e into combinations of 0's and 1.
train_inputs = n_sequences[:,:-1]
train_targets = n_sequences[:,-1]
train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1)
seq_len = train_inputs.shape[1]
train_inputs.shape
@irdanish11
irdanish11 / model.py
Created October 16, 2019 08:39
Creating the LSTM model with an embedding layer and two stacked LSTM layers and two fully connected layers. Adam optimizer is used with learning rate of 0.001 and categorical_crossentropy is used to compute the loss.
def create_model(vocabulary_size, seq_len):
model = Sequential()
model.add(Embedding(vocabulary_size, seq_len,input_length=seq_len))
model.add(LSTM(50,return_sequences=True))
model.add(LSTM(50))
model.add(Dense(50,activation='relu'))
model.add(Dense(vocabulary_size,activation='softmax'))
opt_adam = optimizers.adam(lr=0.001)
#You can simply pass 'adam' to optimizer in compile method. Default learning rate 0.001
#But here we are using adam optimzer from optimizer class to change the LR.
@irdanish11
irdanish11 / train.py
Created October 16, 2019 08:42
call the method to build the graph. checkpoint is created to store the best weights after every epoch. Then model is trained using 500 epochs. Then we also save the tokenizer object for further use.
model = create_model(vocabulary_size+1,seq_len)
path = './checkpoints/word_pred_Model4.h5'
checkpoint = ModelCheckpoint(path, monitor='loss', verbose=1, save_best_only=True, mode='min')
model.fit(train_inputs,train_targets,batch_size=128,epochs=500,verbose=1,callbacks=[checkpoint])
dump(tokenizer,open('tokenizer_Model4','wb'))
@irdanish11
irdanish11 / Word_Prediction.py
Created October 16, 2019 12:20
Predict one word at each time step and only return one sentence.
model = load_model('word_pred_Model4.h5')
tokenizer = load(open('tokenizer_Model4','rb'))
seq_len = 3
def gen_text(model, tokenizer, seq_len, seed_text, num_gen_words):
output_text = []
input_text = seed_text
for i in range(num_gen_words):
encoded_text = tokenizer.texts_to_sequences([input_text])[0]
pad_encoded = pad_sequences([encoded_text], maxlen=seq_len,truncating='pre')
pred_word_ind = model.predict_classes(pad_encoded,verbose=0)[0]
tf.compat.v1.disable_eager_execution() # need to disable eager in TF2.x
x = tf.compat.v1.placeholder(tf.float32, shape=(1024, 1024))
y = tf.matmul(x, x)
with tf.compat.v1.Session() as sess:
rand_array = np.random.rand(1024, 1024)
print(sess.run(y, feed_dict={x: rand_array})) # Will succeed.