This is mostly cribbed from this article: https://medium.com/@javedqadruddin/a-guide-to-simple-text-classification-with-bert-4d4b33f5491b
If you've got a GPU: docker pull tensorflow/tensorflow:1.11.0-gpu-py3
.
If testing locally, docker pull tensorflow/tensorflow:1.11.0-py3
(and replace the image name accordingly)
Create a new project folder, clone the BERT repo there: git clone https://github.com/google-research/bert.git
This assumes you're using the base-uncased model, but you can use the others as well.
wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
unzip uncased_L-12_H-768_A-12.zip
Create a data
directory inside your project folder. BERT requires data in a specific format for train.tsv
and dev.tsv
, and a separate format for test.tsv
(predictions).
Format of train and dev from the aforementioned article:
- Column 1: an ID for the row (can be just a count, or even just the same number or letter for every row if you don’t care to keep track of each individual example),
- Column 2: the label for the row as an int. These are the classification label that your classifier aims to predict.
- Column 3: A column of all the same letter — this is a throw-away column that you need to include because the bert model expects it.
- Column 4: the text examples you want to classify.
1 0 a an example of text that should fit in class 0
2 1 a an example of text that should fit in class 1
3 0 a another class 0 example
4 2 a a class 2 example
For test / prediction:
- Column 1: an ID for each example, similar to column 1 in the train and dev files, and
- Column 2: the text you want to classify. Also, test.tsv should have a header line (whereas train and dev should not).
id sentence
1 my first test example
2 another test example. Yay this is fun!
3 yet another test example
Pandas can get it in this format, and you can use the following to export as a .tsv
:
df.to_csv('data/{train/dev/test}.tsv', sep='\t', index=False, header=False)
# if you are creating test.tsv, set header=True instead of False
Helpful Hints:
- If you're using the uncased model, be sure to
.lower()
your strings - Also be sure to replace newlines in your strings. Newlines can mess up the
tsv
export.** - A good way to do all the preprocessing is:
df[x_column].str.lower().str.replace("\s", " ", regex=True)
.
Use the associated dockertrain.sh
file as a template. Be sure you're mounting the train.sh
and predict.sh
files inside the container.
This will make predictions on the test.csv
file in the data directory, and output them as test_results.tsv
to your BERT_OUTPUT_DIR
.
You'll have to manually look up what the last checkpoint is for the specific model you made, and set this as the TRAINED_CLASSIFIER
environment variable. This variable doesn't point at a specific file, but the prefix of a checkpoint name -- something like model.ckpt-208
, not the .index
or .meta
files.
Use the associated dockerpredict.sh
file as a template for the command.
The get_results.py
file will concatenate your original test.tsv
with the test_results.tsv
from BERT and print out the top 50 highest predictions. Feel free to adjust this script as necessary. Requires pandas
and fire
, and can be run with:
python get_results.py {outcome}
Where outcome is whatever you've named your classification task.
Other Models:
- BERT-Base, Uncased (used in this document)
- BERT-Large, Uncased
- BERT-Base, Cased
- BERT-Large, Cased
BERT Repo: https://github.com/google-research/bert