irdanish11’s gists

irdanish11 / preprocessing.py

Last active October 16, 2019 08:14

Removing Redundant data, cleaning the data, and removing the sentences that are too short are too long.

	#removing the redundant lines
	start_time = time.time()
	unique_data = []
	for i in range(len(data)):
	if data['description'][i] not in unique_data:
	unique_data.append(data['description'][i])
	if i % 5000 == 0:
	print('{0}'.format(i)+' lines have been processed')
	else:
	None

irdanish11 / corpus_read.py

Last active October 16, 2019 08:25

Reading and splitting the data into tokens.

	def read_file(filepath):
	with open(filepath) as f:
	str_text = f.read()
	return str_text

	text = read_file('NameofYourFile.txt')
	tokens = text.split(" ")

irdanish11 / sequences.py

Created October 16, 2019 08:27

Converting the whole text into text sequences of four words and after that we count all the unique words and also count how many times a single word appeared in the corpus.

	train_len = 3+1
	text_sequences = []
	for i in range(train_len,len(tokens)):
	seq = tokens[i-train_len:i]
	text_sequences.append(seq)

	sequences = {}
	count = 1
	for i in range(len(tokens)):
	if tokens[i] not in sequences:

irdanish11 / tokenize.py

Created October 16, 2019 08:30

Map each word in the text sequences to a unique integer and after that convert the list of sequences into a numpy array.

	tokenizer = Tokenizer()
	tokenizer.fit_on_texts(text_sequences)
	sequences = tokenizer.texts_to_sequences(text_sequences)

	#Collecting some information
	vocabulary_size = len(tokenizer.word_counts)

	n_sequences = np.empty([len(sequences),train_len], dtype='int32')
	for i in range(len(sequences)):
	n_sequences[i] = sequences[i]

irdanish11 / split_data

Created October 16, 2019 08:35

Splits the sequences into inputs and output labels for our model. As sequence length was 4, we use first three words and for that three words model will predict the word. The fourth word will be used as label. After that we convert our output labels into one hot vectors i.e into combinations of 0's and 1.

	train_inputs = n_sequences[:,:-1]
	train_targets = n_sequences[:,-1]

	train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1)
	seq_len = train_inputs.shape[1]
	train_inputs.shape

irdanish11 / split_data

Created October 16, 2019 08:35

Splits the sequences into inputs and output labels for our model. As sequence length was 4, we use first three words and for that three words model will predict the word. The fourth word will be used as label. After that we convert our output labels into one hot vectors i.e into combinations of 0's and 1.

	train_inputs = n_sequences[:,:-1]
	train_targets = n_sequences[:,-1]

	train_targets = to_categorical(train_targets, num_classes=vocabulary_size+1)
	seq_len = train_inputs.shape[1]
	train_inputs.shape

irdanish11 / model.py

Created October 16, 2019 08:39

Creating the LSTM model with an embedding layer and two stacked LSTM layers and two fully connected layers. Adam optimizer is used with learning rate of 0.001 and categorical_crossentropy is used to compute the loss.

	def create_model(vocabulary_size, seq_len):
	model = Sequential()
	model.add(Embedding(vocabulary_size, seq_len,input_length=seq_len))
	model.add(LSTM(50,return_sequences=True))
	model.add(LSTM(50))
	model.add(Dense(50,activation='relu'))
	model.add(Dense(vocabulary_size,activation='softmax'))
	opt_adam = optimizers.adam(lr=0.001)
	#You can simply pass 'adam' to optimizer in compile method. Default learning rate 0.001
	#But here we are using adam optimzer from optimizer class to change the LR.

irdanish11 / train.py

Created October 16, 2019 08:42

call the method to build the graph. checkpoint is created to store the best weights after every epoch. Then model is trained using 500 epochs. Then we also save the tokenizer object for further use.

	model = create_model(vocabulary_size+1,seq_len)
	path = './checkpoints/word_pred_Model4.h5'
	checkpoint = ModelCheckpoint(path, monitor='loss', verbose=1, save_best_only=True, mode='min')
	model.fit(train_inputs,train_targets,batch_size=128,epochs=500,verbose=1,callbacks=[checkpoint])
	dump(tokenizer,open('tokenizer_Model4','wb'))

irdanish11 / Word_Prediction.py

Created October 16, 2019 12:20

Predict one word at each time step and only return one sentence.

	model = load_model('word_pred_Model4.h5')
	tokenizer = load(open('tokenizer_Model4','rb'))
	seq_len = 3
	def gen_text(model, tokenizer, seq_len, seed_text, num_gen_words):
	output_text = []
	input_text = seed_text
	for i in range(num_gen_words):
	encoded_text = tokenizer.texts_to_sequences([input_text])[0]
	pad_encoded = pad_sequences([encoded_text], maxlen=seq_len,truncating='pre')
	pred_word_ind = model.predict_classes(pad_encoded,verbose=0)[0]

irdanish11 / tf_compat.py

Last active June 22, 2020 02:36

	tf.compat.v1.disable_eager_execution() # need to disable eager in TF2.x
	x = tf.compat.v1.placeholder(tf.float32, shape=(1024, 1024))
	y = tf.matmul(x, x)

	with tf.compat.v1.Session() as sess:
	rand_array = np.random.rand(1024, 1024)
	print(sess.run(y, feed_dict={x: rand_array})) # Will succeed.

Irfan Danish irdanish11