Created
July 8, 2020 08:47
-
-
Save ntakouris/77b756af247bd5d13d3b957b1adbe9ee to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# 'Raw Data' | |
ds = ['I am writing articles on medium.', 'Medium is good as a platform.'] | |
# pass 1 - lowercase, strip punctuation | |
ds = [x.lower().replace('.', '') for x in dataset] | |
# ['i am writing articles on medium', 'medium is good as a platform'] | |
# pass 2 - tokenize | |
ds_tok = [x.split(' ') for x in ds] | |
## intermediate step - build vocab | |
flat_ds = [item for sublist in ds_tok for item in sublist] | |
vocab_set = set(flat_ds) # distinct tokens | |
vocab_map = {k: v for v, k in enumerate(flat_ds)} # i = 0, am = 2, ... | |
# pass 3 - produce indices to use for embeddings | |
ds_for_train = [[vocab_map[x] for x in sublist] for sublist in ds_tok] | |
# [[0, 1, 2, 3, 4, 6], [6, 7, 8, 9, 10, 11]] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment