Skip to content

Instantly share code, notes, and snippets.

@thsno02
Last active March 22, 2023 06:07
Show Gist options
  • Save thsno02/03aa75d038ce1b5c73b5fc605d3b0325 to your computer and use it in GitHub Desktop.
Save thsno02/03aa75d038ce1b5c73b5fc605d3b0325 to your computer and use it in GitHub Desktop.
Load LLaMATokenizer

Related Issues

Prerequisites

ENV

conda activate llama
conda install python=3.9
pip isntall sentencepiece
pip install torch
pip install fairscale
pip install fire
pip install git+https://github.com/huggingface/transformers.git

FILES

Folder Struture:

.
├── llama
│   ├── __init__.py
│   ├── config.json
│   ├── generation.py
│   ├── generation_config.json
│   ├── model.py
│   ├── special_tokens_map.json
│   ├── tokenizer.model
│   └── tokenizer.py
└── token.ipynb

Load Tokenizer

import transformers

device = "cpu"
model_path = ""
tokenizer = transformers.LlamaTokenizer.from_pretrained(model_path)

Count Chinese Characters

import re 

def contains_chinese(text):
    # Define the regular expression pattern to match Chinese characters
    pattern = re.compile(r'[\u4e00-\u9fa5]')  # This matches all Chinese characters

    # Check if the string contains any Chinese characters
    match = pattern.search(text)
    return match is not None
    
tokens = list(tokenizer.get_vocab().keys())
zh_tokens = []
cnt = 0

for i in tokens:
    if contains_chinese(i):
        cnt += 1
        zh_tokens.append(i)
       
print(cnt)
>>> 700
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment