Skip to content

Instantly share code, notes, and snippets.

@Shivampanwar
Created September 12, 2019 08:08
Show Gist options
  • Save Shivampanwar/00fe7838d364d692a755313992376d14 to your computer and use it in GitHub Desktop.
Save Shivampanwar/00fe7838d364d692a755313992376d14 to your computer and use it in GitHub Desktop.
Tokenizes and converts data into Bert format
train_df.review = train_df.review.str.lower()
sentences = train_df.review.values
# We need to add special tokens at the beginning and end of each sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = train_df.sentiment.values
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])
input_ids=[]
for i in tqdm_notebook(range(len(tokenized_texts))):
input_ids.append(tokenizer.convert_tokens_to_ids(tokenized_texts[i]))
MAX_LEN = 256
input_ids = pad_sequences(input_ids, maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")
#Create attention masks
attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment