With LLMs becoming available in Snowflake as part of their Cortex suite of products in this piece we will explore what the experience is like when classifying text. First of all, Snowflake has native CLASSIFY_TEXT
function that does exactly what it says when given a piece of text and an array of possible categories. Second, one could classify text using emebeddings (EMBED_TEXT_768
) and similarity to possible categories calculated by one of the distance function like cosine similarity (VECTOR_COSINE_SIMILARITY
). Finally, when going the embeddings + similarity route we could use a cross join with a categories table or create a column for each category's similarity score and then assign the greatest one. So we have three approaches.
-
CLASSIFY_TEXT
-
- Embeddings + Similarity cross join
-
- Embeddings + Similarity columnar
In terms of runtime we have a clear winner which you can see from below benchmarks.
Table: Runtime in Seconds
Method | 100 comments | 1,000 comments | 10,000 comments | 50,000 comments |
---|---|---|---|---|
CLASSIFY_TEXT | 4 | 45 | 203 | 1931 |
Embeddings + Similarity crossjoin | 3 | 28 | 276 | 1377 |
Embeddings + Similarity columnar | 2 | 3 | 8 | 33 |
Using embeddings + similarity columnar approach is obviously superior when it comes to runtime and consequently cost.
Now, when it comes to "accuracy" the story is not so straightforward. Inspecting 1,000 comments classified by CLASSIFY_TEXT
and the embedding + similarity it is clear that both approaches are of comparable accuracy, but not in complete agreement. Thankfully CLASSIFY_TEXT
seems to produce slightly more relevant classifications.
References
Below are the queries used which you can plug into your workflow right away and/or verify above results with your own dataset.
CLASSIFY_TEXT
SELECT comment,
replace(SNOWFLAKE.CORTEX.classify_text(comment, [
'OTHER ISSUES',
'CANCEL SUBSCRIPTION',
'CHANGE SUBSCRIPTION',
'TECHNICAL ISSUES',
'AUTHENTICATE THIRD PARTY PURCHASE',
'BILLING INQUIRY',
'FEEDBACK AND SUGGESTIONS',
'PURCHASE SUBSCRIPTION',
'LOG IN ASSISTANCE',
'ID MEMBERSHIP',
'BLACKOUTS'
])['label'],'"','') as snowflake_category,
from comments
Embeddings + Similarity columnar
with comments as (...)
/*
Here we start using embeddings to categorize the comments into most similar category
*/
comment_embeddings AS (
SELECT
comment,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', comment) as comment_embedding
FROM comments
),
category_similarities AS (
SELECT
comment,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'OTHER ISSUES')) as OTHER_ISSUES,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'CANCEL SUBSCRIPTION')) as CANCEL_SUBSCRIPTION,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'CHANGE SUBSCRIPTION')) as CHANGE_SUBSCRIPTION,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'TECHNICAL ISSUES')) as TECHNICAL_ISSUES,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'AUTHENTICATE THIRD PARTY PURCHASE')) as AUTHENTICATE_THIRD_PARTY_PURCHASE,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'BILLING INQUIRY')) as BILLING_INQUIRY,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'FEEDBACK AND SUGGESTIONS')) as FEEDBACK_AND_SUGGESTIONS,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'PURCHASE SUBSCRIPTION')) as PURCHASE_SUBSCRIPTION,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'LOG IN ASSISTANCE')) as LOG_IN_ASSISTANCE,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'NBA ID MEMBERSHIP')) as NBA_ID_MEMBERSHIP,
VECTOR_COSINE_SIMILARITY(comment_embedding, SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 'BLACKOUTS')) as BLACKOUTS
FROM comment_embeddings
)
SELECT
comment,
CASE GREATEST(OTHER_ISSUES, CANCEL_SUBSCRIPTION, CHANGE_SUBSCRIPTION, TECHNICAL_ISSUES,
AUTHENTICATE_THIRD_PARTY_PURCHASE, BILLING_INQUIRY, FEEDBACK_AND_SUGGESTIONS,
PURCHASE_SUBSCRIPTION, LOG_IN_ASSISTANCE, NBA_ID_MEMBERSHIP, BLACKOUTS)
WHEN OTHER_ISSUES THEN 'OTHER ISSUES'
WHEN CANCEL_SUBSCRIPTION THEN 'CANCEL SUBSCRIPTION'
WHEN CHANGE_SUBSCRIPTION THEN 'CHANGE SUBSCRIPTION'
WHEN TECHNICAL_ISSUES THEN 'TECHNICAL ISSUES'
WHEN AUTHENTICATE_THIRD_PARTY_PURCHASE THEN 'AUTHENTICATE THIRD PARTY PURCHASE'
WHEN BILLING_INQUIRY THEN 'BILLING INQUIRY'
WHEN FEEDBACK_AND_SUGGESTIONS THEN 'FEEDBACK AND SUGGESTIONS'
WHEN PURCHASE_SUBSCRIPTION THEN 'PURCHASE SUBSCRIPTION'
WHEN LOG_IN_ASSISTANCE THEN 'LOG IN ASSISTANCE'
WHEN NBA_ID_MEMBERSHIP THEN 'ID MEMBERSHIP'
WHEN BLACKOUTS THEN 'BLACKOUTS'
END as categorized_as,
GREATEST(OTHER_ISSUES, CANCEL_SUBSCRIPTION, CHANGE_SUBSCRIPTION, TECHNICAL_ISSUES,
AUTHENTICATE_THIRD_PARTY_PURCHASE, BILLING_INQUIRY, FEEDBACK_AND_SUGGESTIONS,
PURCHASE_SUBSCRIPTION, LOG_IN_ASSISTANCE, NBA_ID_MEMBERSHIP, BLACKOUTS) as similarity
FROM category_similarities
ORDER BY comment DESC;
Embeddings + Similarity crossjoin
with comments as (...)
/*
Here we start using embeddings to categorize the comments into most similar category
*/
categories AS (
-- categories to categorize comments into
SELECT category
FROM (VALUES
('OTHER ISSUES'),
('CANCEL SUBSCRIPTION'),
('CHANGE SUBSCRIPTION'),
('TECHNICAL ISSUES'),
('AUTHENTICATE THIRD PARTY PURCHASE'),
('BILLING INQUIRY'),
('FEEDBACK AND SUGGESTIONS'),
('PURCHASE SUBSCRIPTION'),
('LOG IN ASSISTANCE'),
('ID MEMBERSHIP'),
('BLACKOUTS')
) categories(category)
),
comment_embeddings AS (
-- get embedings for comments
SELECT
comment,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', comment) as comment_embedding
FROM comments
),
category_embeddings AS (
-- get embeddings for categories
SELECT
category,
SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', category) as category_embedding
FROM categories
),
similarities AS (
-- calculate similarities between a comment and each category
SELECT
c.comment,
cat.category,
VECTOR_COSINE_SIMILARITY(c.comment_embedding, cat.category_embedding) as similarity
FROM comment_embeddings c
CROSS JOIN category_embeddings cat
),
ranked_categories AS (
-- rank categories per comment
SELECT
comment,
category,
similarity,
ROW_NUMBER() OVER (PARTITION BY comment ORDER BY similarity DESC) as rank
FROM similarities
)
-- select categories most similar to comment
SELECT
comment,
similarity
FROM ranked_categories
WHERE rank = 1
ORDER BY comment DESC;