OpenAI - Calculating Token Counts and Estimating Costs

Utilize this code to quickly estimate processing costs for your dataset with OpenAI.

dataset.csv

text
Music is a universal language that connects people across cultures.
Listening to music can improve your mood and reduce stress.
Classical music has a rich history and deeply influences modern genres.
Jazz is known for its improvisational style and complex harmonies.
Rock music emerged in the 1950s and became a cultural phenomenon.
Hip-hop combines rhythmic speech with beats and is a voice for social commentary.
Country music often tells stories of everyday life and love.
Electronic music has revolutionized the way we think about sound production.
Pop music dominates the charts with catchy melodies and broad appeal.
"Music therapy is used to aid mental, emotional, and physical healing."

main.py

import pandas as pd
import tiktoken

MODEL_NAME = "gpt-4o-mini"
COST_PER_1M_TOKENS = 0.150  # Cost per 1M tokens - https://openai.com/api/pricing/
DATASET_FILE = "dataset.csv"


def main():
    encoding = tiktoken.encoding_for_model(MODEL_NAME)

    df = pd.read_csv(DATASET_FILE)

    data_as_string = df["text"].tolist()  # Select fields from the dataset

    token_counts = [len(encoding.encode(text)) for text in data_as_string]
    total_tokens = sum(token_counts)

    estimated_cost = total_tokens * COST_PER_1M_TOKENS / 1_000_000

    print(f"Total tokens {total_tokens} tokens")
    print(f"Estimated cost: {estimated_cost:.8f} USD")


if __name__ == "__main__":
    main()

johnidm/readme.md