Skip to content

Instantly share code, notes, and snippets.

@ntakouris
Created July 8, 2020 08:47
Show Gist options
  • Save ntakouris/77b756af247bd5d13d3b957b1adbe9ee to your computer and use it in GitHub Desktop.
Save ntakouris/77b756af247bd5d13d3b957b1adbe9ee to your computer and use it in GitHub Desktop.
# 'Raw Data'
ds = ['I am writing articles on medium.', 'Medium is good as a platform.']
# pass 1 - lowercase, strip punctuation
ds = [x.lower().replace('.', '') for x in dataset]
# ['i am writing articles on medium', 'medium is good as a platform']
# pass 2 - tokenize
ds_tok = [x.split(' ') for x in ds]
## intermediate step - build vocab
flat_ds = [item for sublist in ds_tok for item in sublist]
vocab_set = set(flat_ds) # distinct tokens
vocab_map = {k: v for v, k in enumerate(flat_ds)} # i = 0, am = 2, ...
# pass 3 - produce indices to use for embeddings
ds_for_train = [[vocab_map[x] for x in sublist] for sublist in ds_tok]
# [[0, 1, 2, 3, 4, 6], [6, 7, 8, 9, 10, 11]]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment