Last active
June 20, 2021 12:54
-
-
Save jamescalam/3ee07c99531985654fcaf586eb9fb83d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"source": [ | |
"Get a list of paths to each file in our *oscar_it* directory." | |
], | |
"cell_type": "markdown", | |
"metadata": {} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from pathlib import Path\n", | |
"paths = [str(x) for x in Path('../../data/text/oscar_it').glob('**/*.txt')]" | |
] | |
}, | |
{ | |
"source": [ | |
"Now we move onto training the tokenizer. We use a byte-level Byte-pair encoding (BPE) tokenizer. This allows us to build the vocabulary from an alphabet of single bytes, meaning all words will be decomposable into tokens." | |
], | |
"cell_type": "markdown", | |
"metadata": {} | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from tokenizers import ByteLevelBPETokenizer\n", | |
"\n", | |
"tokenizer = ByteLevelBPETokenizer()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"tokenizer.train(files=paths[:5], vocab_size=30_522, min_frequency=2,\n", | |
" special_tokens=['<s>', '<pad>', '</s>', '<unk>', '<mask>'])" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "ML", | |
"language": "python", | |
"name": "ml" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.8.5" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment