Skip to content

Instantly share code, notes, and snippets.

@KobaKhit
Last active September 16, 2024 16:06
Show Gist options
  • Save KobaKhit/1ad2cd01cd54cfce1e606f0243e134b1 to your computer and use it in GitHub Desktop.
Save KobaKhit/1ad2cd01cd54cfce1e606f0243e134b1 to your computer and use it in GitHub Desktop.
Text Classification in Snowflake SQL

Classifying Text in Snowflake SQL

With LLMs becoming available in Snowflake as part of their Cortex suite of products in this piece we will explore what the experience is like when classifying text. First of all, Snowflake has native CLASSIFY_TEXT function that does exactly what it says when given a piece of text and an array of possible categories. Second, one could classify text using emebeddings (EMBED_TEXT_768) and similarity to possible categories calculated by one of the distance function like cosine similarity (VECTOR_COSINE_SIMILARITY). Finally, when going the embeddings + similarity route we could use a cross join with a categories table or create a column for each category's similarity score and then assign the greatest one. So we have three approaches.

    1. CLASSIFY_TEXT
    1. Embeddings + Similarity cross join
    1. Embeddings + Similarity columnar

In terms of runtime we have a clear winner which you can see from below benchmarks.


Table: Runtime in Seconds

Method 100 comments 1,000 comments 10,000 comments 50,000 comments
CLASSIFY_TEXT 4 45 203 1931
Embeddings + Similarity crossjoin 3 28 276 1377
Embeddings + Similarity columnar 2 3 8 33

Using embeddings + similarity columnar approach is obviously superior when it comes to runtime and consequently cost.

Now, when it comes to "accuracy" the story is not so straightforward. Inspecting 1,000 comments classified by CLASSIFY_TEXT and the embedding + similarity it is clear that both approaches are of comparable accuracy, but not in complete agreement. Thankfully CLASSIFY_TEXT seems to produce slightly more relevant classifications.

References

Below are the queries used which you can plug into your workflow right away and/or verify above results with your own dataset.

CLASSIFY_TEXT

SELECT comment,
    replace(SNOWFLAKE.CORTEX.classify_text(comment, [
    'OTHER ISSUES',
    'CANCEL SUBSCRIPTION',
    'CHANGE SUBSCRIPTION',
    'TECHNICAL ISSUES',
    'AUTHENTICATE THIRD PARTY PURCHASE',
    'BILLING INQUIRY',
    'FEEDBACK AND SUGGESTIONS',
    'PURCHASE SUBSCRIPTION',
    'LOG IN ASSISTANCE',
    'ID MEMBERSHIP',
    'BLACKOUTS'
])['label'],'"','') as snowflake_category,
from comments

Embeddings + Similarity columnar

with comments as (...)
/*
Here we start using embeddings to categorize the comments into most similar category
*/
comment_embeddings AS (
    SELECT 
        comment,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', comment) as comment_embedding
    FROM comments
),
category_similarities AS (
    SELECT 
        comment,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'OTHER ISSUES')) as OTHER_ISSUES,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'CANCEL SUBSCRIPTION')) as CANCEL_SUBSCRIPTION,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'CHANGE SUBSCRIPTION')) as CHANGE_SUBSCRIPTION,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'TECHNICAL ISSUES')) as TECHNICAL_ISSUES,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'AUTHENTICATE THIRD PARTY PURCHASE')) as AUTHENTICATE_THIRD_PARTY_PURCHASE,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'BILLING INQUIRY')) as BILLING_INQUIRY,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'FEEDBACK AND SUGGESTIONS')) as FEEDBACK_AND_SUGGESTIONS,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'PURCHASE SUBSCRIPTION')) as PURCHASE_SUBSCRIPTION,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'LOG IN ASSISTANCE')) as LOG_IN_ASSISTANCE,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'NBA ID MEMBERSHIP')) as NBA_ID_MEMBERSHIP,
        VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'BLACKOUTS')) as BLACKOUTS
    FROM comment_embeddings
)
SELECT 
    comment,
    CASE GREATEST(OTHER_ISSUES, CANCEL_SUBSCRIPTION, CHANGE_SUBSCRIPTION, TECHNICAL_ISSUES, 
                  AUTHENTICATE_THIRD_PARTY_PURCHASE, BILLING_INQUIRY, FEEDBACK_AND_SUGGESTIONS, 
                  PURCHASE_SUBSCRIPTION, LOG_IN_ASSISTANCE, NBA_ID_MEMBERSHIP, BLACKOUTS)
        WHEN OTHER_ISSUES THEN 'OTHER ISSUES'
        WHEN CANCEL_SUBSCRIPTION THEN 'CANCEL SUBSCRIPTION'
        WHEN CHANGE_SUBSCRIPTION THEN 'CHANGE SUBSCRIPTION'
        WHEN TECHNICAL_ISSUES THEN 'TECHNICAL ISSUES'
        WHEN AUTHENTICATE_THIRD_PARTY_PURCHASE THEN 'AUTHENTICATE THIRD PARTY PURCHASE'
        WHEN BILLING_INQUIRY THEN 'BILLING INQUIRY'
        WHEN FEEDBACK_AND_SUGGESTIONS THEN 'FEEDBACK AND SUGGESTIONS'
        WHEN PURCHASE_SUBSCRIPTION THEN 'PURCHASE SUBSCRIPTION'
        WHEN LOG_IN_ASSISTANCE THEN 'LOG IN ASSISTANCE'
        WHEN NBA_ID_MEMBERSHIP THEN 'ID MEMBERSHIP'
        WHEN BLACKOUTS THEN 'BLACKOUTS'
    END as categorized_as,
    GREATEST(OTHER_ISSUES, CANCEL_SUBSCRIPTION, CHANGE_SUBSCRIPTION, TECHNICAL_ISSUES, 
             AUTHENTICATE_THIRD_PARTY_PURCHASE, BILLING_INQUIRY, FEEDBACK_AND_SUGGESTIONS, 
             PURCHASE_SUBSCRIPTION, LOG_IN_ASSISTANCE, NBA_ID_MEMBERSHIP, BLACKOUTS) as similarity
FROM category_similarities
ORDER BY comment DESC;

Embeddings + Similarity crossjoin

with comments as (...)
/*
Here we start using embeddings to categorize the comments into most similar category
*/
categories AS (
  -- categories to categorize comments into
    SELECT category
    FROM (VALUES
        ('OTHER ISSUES'),
        ('CANCEL SUBSCRIPTION'),
        ('CHANGE SUBSCRIPTION'),
        ('TECHNICAL ISSUES'),
        ('AUTHENTICATE THIRD PARTY PURCHASE'),
        ('BILLING INQUIRY'),
        ('FEEDBACK AND SUGGESTIONS'),
        ('PURCHASE SUBSCRIPTION'),
        ('LOG IN ASSISTANCE'),
        ('ID MEMBERSHIP'),
        ('BLACKOUTS')
    ) categories(category)
),
comment_embeddings AS (
  -- get embedings for comments
    SELECT 
        comment,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', comment) as comment_embedding
    FROM comments
),
category_embeddings AS (
  -- get embeddings for categories
    SELECT 
        category,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', category) as category_embedding
    FROM categories
),
similarities AS (
  -- calculate similarities between a comment and each category
    SELECT 
        c.comment,
        cat.category,
        VECTOR_COSINE_SIMILARITY(c.comment_embedding, cat.category_embedding) as similarity
    FROM comment_embeddings c
    CROSS JOIN category_embeddings cat
),
ranked_categories AS (
  -- rank categories per comment
    SELECT 
        comment,
        category,
        similarity,
        ROW_NUMBER() OVER (PARTITION BY comment ORDER BY similarity DESC) as rank
    FROM similarities
)
-- select categories most similar to comment
SELECT 
    comment,
    similarity
FROM ranked_categories
WHERE rank = 1
ORDER BY comment DESC;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment