Skip to content

Instantly share code, notes, and snippets.

@analyticsindiamagazine
Last active December 2, 2019 03:26
Show Gist options
  • Save analyticsindiamagazine/e24e59f332a74872f5b036615e807c41 to your computer and use it in GitHub Desktop.
Save analyticsindiamagazine/e24e59f332a74872f5b036615e807c41 to your computer and use it in GitHub Desktop.
#TESTING
max_seq_length = 128 #This number will determine the number of tokens
#An example for tokenization
s1 = train['STORY'].iloc[0]
stokens1 = tokenizer.tokenize(s1)
stokens1 = ["[CLS]"] + stokens1 + ["[SEP]"]
input_ids1 = get_ids(stokens1, tokenizer, max_seq_length)
input_masks1 = get_masks(stokens1, max_seq_length)
input_segments1 = get_segments(stokens1, max_seq_length)
print("IDS # len:" , len(input_ids1), " ::: ",input_ids1)
print("MASKS # len:" , len(input_masks1), " ::: ",input_masks1)
print("SEGEMNTS # len:" , len(input_segments1), " ::: ",input_segments1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment