Created
March 15, 2019 09:12
-
-
Save phisad/dcaee742a329a70a6f3ee037463b8be6 to your computer and use it in GitHub Desktop.
How to encode and pad texts for machine learning using Keras
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from tensorflow.keras.preprocessing.sequence import pad_sequences | |
from tensorflow.keras.preprocessing.text import hashing_trick | |
def preprocessing(questions, questions_max_length, vocabulary_size): | |
""" | |
Stateless preprocessing the text questions to a one-hot encoding and pads to max length questions. | |
The one-hot encodings max value is the vocabulary size. | |
The padding is attached at the end of each question up to the maximal question length. | |
@param questions: the text questions as list | |
@param vocalbulary_size: the (globally) amount of known words | |
@param questions_max_length: the (globally) maximal length of a question | |
@return: the padded and encoded questions | |
""" | |
encoded_questions = [hashing_trick(question, round(vocabulary_size * 1.3), hash_function="md5") for question in questions] | |
padded_questions = pad_sequences(encoded_questions, maxlen=questions_max_length, padding="post") | |
return padded_questions |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment