A Python script for splitting text into parts with controlled (limited) length in tokens. This script utilizes the tiktoken library for encoding and decoding text.
Have you ever needed to split a long text into smaller parts with a specific token limit? The Text Splitter script is here to help! This Python script takes a text input, tokenizes it using the specified encoding, and splits it into parts, ensuring that each part does not exceed the given token limit. It then converts the tokenized parts back into human-readable text.
- Copy the text of the gist and save it on your drive, or clone it:
git clone https://gist.github.com/17d9c8ab644bd2762acf6b19dd0cea39-
Install the required dependencies. The script relies on the
tiktokenlibrary, which can be installed using pip:pip install tiktoken
To use the text splitter module, follow these steps:
-
Import the
split_stringfunction from the module:from split_string import split_string_with_limit
-
Obtain an encoding using the
tiktokenlibrary. You can choose from different pre-trained encodings or create your own.import tiktoken encoding = tiktoken.get_encoding("cl100k_base")
-
Provide the text you want to split, the token limit, and the encoding to the
split_string_with_limitfunction. This will return a list of text parts.text = "This is a sample sentence for testing the string splitting function." limit = 5 texts = split_string_with_limit(text, limit, encoding)
-
Use the
textsvariable to access the split text parts as a list.
Here's an example usage of the Text Splitter script:
import tiktoken
from split_string import split_string_with_limit
# Obtain encoding
encoding = tiktoken.get_encoding("cl100k_base")
# Input text and token limit
text = "This is a sample sentence for testing the string splitting function."
limit = 5
# Split the text
texts = split_string_with_limit(text, limit, encoding)
# Print the split text parts
for part in texts:
print(part)Output:
This is a
sample sentence for
testing the string
splitting function.
This project is licensed under the MIT License - see the LICENSE file for details.
['которому все ра��но придется']