Skip to content

Instantly share code, notes, and snippets.

@pmbaumgartner
Last active September 26, 2019 17:18
Show Gist options
  • Save pmbaumgartner/851b50cf7d14838a37253f6f93fdc8b5 to your computer and use it in GitHub Desktop.
Save pmbaumgartner/851b50cf7d14838a37253f6f93fdc8b5 to your computer and use it in GitHub Desktop.
Getting Bert Working!
#!/bin/sh
# use this to get predictions on a test.csv located in BERT_DATA_DIR
export OUTCOME={classification_task_name}
export NOTEBOOK=/notebooks # don't change me
docker run --runtime=nvidia -it --rm \
-v $(pwd):$NOTEBOOK/ \
-e "BERT_BASE_DIR=$NOTEBOOK/uncased_L-12_H-768_A-12" \
-e "BERT_DATA_DIR=$NOTEBOOK/data/$OUTCOME/" \
-e "BERT_OUTPUT_DIR=$NOTEBOOK/bert_output/$OUTCOME/" \
-e "TRAINED_CLASSIFIER=/notebooks/bert_output/$OUTCOME/{model_checkpoint}" \
-e "NVIDIA_VISIBLE_DEVICES=0" \
tensorflow/tensorflow:1.11.0-gpu-py3 bash predict.sh
#!/bin/sh
# use this script for training with some flexibility on where models get stored
export OUTCOME={classification_task_name}
export NOTEBOOK=/notebooks # don't change me
docker run --runtime=nvidia -it --rm \
-v $(pwd):$NOTEBOOK/ \
-e "BERT_BASE_DIR=$NOTEBOOK/uncased_L-12_H-768_A-12" \
-e "BERT_DATA_DIR=$NOTEBOOK/data/$OUTCOME/" \
-e "BERT_OUTPUT_DIR=$NOTEBOOK/bert_output/$OUTCOME/" \
-e "NVIDIA_VISIBLE_DEVICES=3" \
tensorflow/tensorflow:1.11.0-gpu-py3 bash train.sh
"""This script will print out the top 50 predicted values
from the test set.
Meant to be run from the command line like:
$ python get_results.py {outcome}
"""
import fire
import pandas as pd
def get_results(outcome):
scores = pd.read_csv(f"./bert_output/{outcome}/test_results.tsv", sep="\t", header=None)
texts = pd.read_csv(f'./data/{outcome}/test.tsv', sep="\t")
d = pd.concat([texts, scores], axis=1)
print(d.sort_values(1, ascending=False)[['sentence', 1]].values[:50])
if __name__ == '__main__':
fire.Fire(get_results)

Getting BERT Working

This is mostly cribbed from this article: https://medium.com/@javedqadruddin/a-guide-to-simple-text-classification-with-bert-4d4b33f5491b

Getting the Right Docker Container

If you've got a GPU: docker pull tensorflow/tensorflow:1.11.0-gpu-py3.

If testing locally, docker pull tensorflow/tensorflow:1.11.0-py3 (and replace the image name accordingly)

Cloning the BERT Repository

Create a new project folder, clone the BERT repo there: git clone https://github.com/google-research/bert.git

Geting the pre-trained weights

This assumes you're using the base-uncased model, but you can use the others as well.

wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip

Processing the Data

Create a data directory inside your project folder. BERT requires data in a specific format for train.tsv and dev.tsv, and a separate format for test.tsv (predictions).

Format of train and dev from the aforementioned article:

  • Column 1: an ID for the row (can be just a count, or even just the same number or letter for every row if you don’t care to keep track of each individual example),
  • Column 2: the label for the row as an int. These are the classification label that your classifier aims to predict.
  • Column 3: A column of all the same letter — this is a throw-away column that you need to include because the bert model expects it.
  • Column 4: the text examples you want to classify.
1    0    a    an example of text that should fit in class 0
2    1    a    an example of text that should fit in class 1
3    0    a    another class 0 example
4    2    a    a class 2 example

For test / prediction:

  • Column 1: an ID for each example, similar to column 1 in the train and dev files, and
  • Column 2: the text you want to classify. Also, test.tsv should have a header line (whereas train and dev should not).
id  sentence
1   my first test example
2   another test example. Yay this is fun!
3   yet another test example

Pandas can get it in this format, and you can use the following to export as a .tsv:

df.to_csv('data/{train/dev/test}.tsv', sep='\t', index=False, header=False)
# if you are creating test.tsv, set header=True instead of False

Helpful Hints:

  • If you're using the uncased model, be sure to .lower() your strings
  • Also be sure to replace newlines in your strings. Newlines can mess up the tsv export.**
  • A good way to do all the preprocessing is: df[x_column].str.lower().str.replace("\s", " ", regex=True).

Training the model

Use the associated dockertrain.sh file as a template. Be sure you're mounting the train.sh and predict.sh files inside the container.

Making Predictions

This will make predictions on the test.csv file in the data directory, and output them as test_results.tsv to your BERT_OUTPUT_DIR.

You'll have to manually look up what the last checkpoint is for the specific model you made, and set this as the TRAINED_CLASSIFIER environment variable. This variable doesn't point at a specific file, but the prefix of a checkpoint name -- something like model.ckpt-208, not the .index or .meta files.

Use the associated dockerpredict.sh file as a template for the command.

Viewing Predictions

The get_results.py file will concatenate your original test.tsv with the test_results.tsv from BERT and print out the top 50 highest predictions. Feel free to adjust this script as necessary. Requires pandas and fire, and can be run with:

python get_results.py {outcome}

Where outcome is whatever you've named your classification task.

Helpful Resources

Other Models:

BERT Repo: https://github.com/google-research/bert

python bert/run_classifier.py \
--task_name=cola \
--do_predict=true \
--data_dir=$BERT_DATA_DIR \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$TRAINED_CLASSIFIER \
--max_seq_length=128 \
--output_dir=$BERT_OUTPUT_DIR
python bert/run_classifier.py \
--task_name=cola \
--do_train=true \
--do_eval=true \
--data_dir=$BERT_DATA_DIR \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=$BERT_OUTPUT_DIR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment