Skip to content

Instantly share code, notes, and snippets.

@buanzo
Last active December 28, 2023 10:12
Show Gist options
  • Save buanzo/7cdd2c34fc0bb25c71b857a16853c6fa to your computer and use it in GitHub Desktop.
Save buanzo/7cdd2c34fc0bb25c71b857a16853c6fa to your computer and use it in GitHub Desktop.
# This is a work in progress. There are still bugs. Once it is production-ready this will become a full repo.
import os
def count_tokens(text, model_name="gpt-3.5-turbo", debug=False):
"""
Count the number of tokens in a given text string without using the OpenAI API.
This function tries three methods in the following order:
1. tiktoken (preferred): Accurate token counting similar to the OpenAI API.
2. nltk: Token counting using the Natural Language Toolkit library.
3. split: Simple whitespace-based token counting as a fallback.
Usage:
------
text = "Your text here"
result = count_tokens(text, model_name="gpt-3.5-turbo", debug=True)
print(result)
Required libraries:
-------------------
- tiktoken: Install with 'pip install tiktoken'
- nltk: Install with 'pip install nltk'
Parameters:
-----------
text : str
The text string for which you want to count tokens.
model_name : str, optional
The OpenAI model for which you want to count tokens (default: "gpt-3.5-turbo").
debug : bool, optional
Set to True to print error messages (default: False).
Returns:
--------
result : dict
A dictionary containing the number of tokens and the method used for counting.
"""
# Try using tiktoken
try:
import tiktoken
encoding = tiktoken.encoding_for_model(model_name)
num_tokens = len(encoding.encode(text))
result = {"n_tokens": num_tokens, "method": "tiktoken"}
return result
except Exception as e:
if debug:
print(f"Error using tiktoken: {e}")
pass
# Try using nltk
try:
import nltk
nltk.download("punkt")
tokens = nltk.word_tokenize(text)
result = {"n_tokens": len(tokens), "method": "nltk"}
return result
except Exception as e:
if debug:
print(f"Error using nltk: {e}")
pass
# If nltk and tiktoken fail, use a simple split-based method
tokens = text.split()
result = {"n_tokens": len(tokens), "method": "split"}
return result
class TokenBuffer:
def __init__(self, max_tokens=2048):
self.max_tokens = max_tokens
self.buffer = ""
self.token_lengths = []
self.token_count = 0
def update(self, text, model_name="gpt-3.5-turbo", debug=False):
new_tokens = count_tokens(text, model_name=model_name, debug=debug)["n_tokens"]
self.token_count += new_tokens
self.buffer += text
self.token_lengths.append(new_tokens)
while self.token_count > self.max_tokens:
removed_tokens = self.token_lengths.pop(0)
self.token_count -= removed_tokens
self.buffer = self.buffer.split(" ", removed_tokens)[-1]
def get_buffer(self):
return self.buffer
@buanzo
Copy link
Author

buanzo commented Apr 11, 2023

The count_tokens function provides a robust way to count tokens in a text string, even when the required libraries are not available. Suppose we want to count the tokens in the following text string: "The quick brown fox jumps over the lazy dog."

We will pass debug=True to see what is going on.

We can import the count_tokens function from the token_counter module and call it with our text string as follows:

from token_counter import count_tokens

text = "The quick brown fox jumps over the lazy dog."
result = count_tokens(text, debug=True)
print(result)

If all the required libraries are available, the function should return a dictionary with the number of tokens and the method used to count them. For example:

{'n_tokens': 9, 'method': 'tiktoken'}

If we uninstall the tiktoken library, the function will fall back to using nltk. If both tiktoken and nltk are not available, the function will use a simple whitespace-based method to count the tokens. For example:

[nltk_data] Downloading package punkt to /home/zob/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
{'n_tokens': 9, 'method': 'nltk'}

If we also uninstall the nltk library, the function will use the simple whitespace-based method as a fallback. For example:

Error using nltk: No module named 'nltk'
{'n_tokens': 9, 'method': 'split'}

@buanzo
Copy link
Author

buanzo commented Apr 11, 2023

To use the TokenBuffer class, you can create an instance with an optional max_tokens argument and then call the update() method to add text. The get_buffer() method returns the current buffer.

tb = TokenBuffer(max_tokens=2048)
tb.update("Some text to add to the buffer.")
print(tb.get_buffer())

The update() method ensures that the buffer always contains at most max_tokens by removing characters from the beginning of the buffer when the limit is exceeded.

Using a token buffer like the one implemented in the TokenBuffer class is useful for working with OpenAI's API for several reasons:

Token Limit: OpenAI's models have a maximum token limit per API call (e.g., 4096 tokens for gpt-3.5-turbo). By using a token buffer, you can manage and control the text input to ensure it stays within the allowed token limit, preventing errors when making API calls.

Cost Control: OpenAI's API pricing is based on the number of tokens processed. By maintaining a token buffer, you can keep track of the tokens used, helping you manage costs more effectively and avoid exceeding your budget.

Text Truncation: When dealing with long text inputs or a stream of text, using a token buffer can help you truncate or remove less relevant text while preserving the most recent or relevant information. This is particularly useful when working with conversational AI applications, where the latest information might be more important for generating appropriate responses.

Rate Limiting: OpenAI's API has rate limits based on tokens processed per minute. A token buffer helps you stay within these rate limits, ensuring that your application can operate smoothly without encountering rate limit errors.

Overall, using a token buffer like the TokenBuffer class is a practical way to manage tokens when working with OpenAI's API, helping you stay within token limits, control costs, and manage text inputs more effectively.

@buanzo
Copy link
Author

buanzo commented Apr 11, 2023

Example usage for TokenBuffer:

from token_counter import TokenBuffer

# Initialize a TokenBuffer with a maximum token count of 30
buffer = TokenBuffer(max_tokens=30)

# Add a sentence to the buffer
buffer.update("Hello, how are you doing?")
print(buffer.get_buffer())
print("Token count:", buffer.token_count)

# Add another sentence to the buffer
buffer.update("I'm doing well, thank you!")
print(buffer.get_buffer())
print("Token count:", buffer.token_count)

# Add a longer sentence to the buffer
buffer.update("I've been working on a project and making great progress.")
print(buffer.get_buffer())
print("Token count:", buffer.token_count)

# Add one more sentence to the buffer
buffer.update("That's great to hear, keep up the good work!")
print(buffer.get_buffer())
print("Token count:", buffer.token_count)

Output (YMMV):

Hello, how are you doing?
Token count: 6
Hello, how are you doing?I'm doing well, thank you!
Token count: 11
Hello, how are you doing?I'm doing well, thank you!I've been working on a project and making great progress.
Token count: 24
I'm doing well, thank you!I've been working on a project and making great progress.That's great to hear, keep up the good work!
Token count: 30

@buanzo
Copy link
Author

buanzo commented Apr 11, 2023

Tiktoken's github repo: https://github.com/openai/tiktoken

@buanzo
Copy link
Author

buanzo commented Apr 11, 2023

NLTK's github repo: https://github.com/nltk/nltk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment