Skip to content

Instantly share code, notes, and snippets.

@johnidm
Created October 5, 2024 18:29
Show Gist options
  • Save johnidm/1cf93b99aa4aeae8f21ba2e926119264 to your computer and use it in GitHub Desktop.
Save johnidm/1cf93b99aa4aeae8f21ba2e926119264 to your computer and use it in GitHub Desktop.
OpenAI - Calculating Token Counts and Estimating Costs

Utilize this code to quickly estimate processing costs for your dataset with OpenAI.

dataset.csv

text
Music is a universal language that connects people across cultures.
Listening to music can improve your mood and reduce stress.
Classical music has a rich history and deeply influences modern genres.
Jazz is known for its improvisational style and complex harmonies.
Rock music emerged in the 1950s and became a cultural phenomenon.
Hip-hop combines rhythmic speech with beats and is a voice for social commentary.
Country music often tells stories of everyday life and love.
Electronic music has revolutionized the way we think about sound production.
Pop music dominates the charts with catchy melodies and broad appeal.
"Music therapy is used to aid mental, emotional, and physical healing."

main.py

import pandas as pd
import tiktoken

MODEL_NAME = "gpt-4o-mini"
COST_PER_1M_TOKENS = 0.150  # Cost per 1M tokens - https://openai.com/api/pricing/
DATASET_FILE = "dataset.csv"


def main():
    encoding = tiktoken.encoding_for_model(MODEL_NAME)

    df = pd.read_csv(DATASET_FILE)

    data_as_string = df["text"].tolist()  # Select fields from the dataset

    token_counts = [len(encoding.encode(text)) for text in data_as_string]
    total_tokens = sum(token_counts)

    estimated_cost = total_tokens * COST_PER_1M_TOKENS / 1_000_000

    print(f"Total tokens {total_tokens} tokens")
    print(f"Estimated cost: {estimated_cost:.8f} USD")


if __name__ == "__main__":
    main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment