sql_for_beginners.md

Advanced SQL techniques for beginners

On a scale from 1 to 10 how good are your data warehousing skills? Want to go above 7/10? This article is for you then.

How good is your SQL? Want to get ready for a job interview asap? This blog post explains the most intricate data warehouse SQL techniques in high detail. I will use BigQuery standard SQL dialect to scribble down a few thoughts on this.

1. Incremental tables and MERGE

Updating table is important. It is important indeed. Ideal situation is when you have transactions that are a PRIMARY key, unique integers and auto increment. Table update in this case is simple:

insert target_table (transaction_id)
select transaction_id from source_table where transaction_id > (select max(transaction_id) from target_table);

That is not always the case when working with denormalized star-schema datasets in modern data warehouses. you might be tasked to create sessions with SQL and/or incrementally update datasets with just a portion of data. transaction_id might not exist but instead you will have to deal with data model where unique key depends on the latest transaction_id (or timestamp) known. For example, user_id in last_online dataset depends on the latest known connection timestamp. In this case you would want to update existing users and insert the new ones.

MERGE and incremental updates

You can use MERGE or you can split the operation into two actions. One to update existing records with new ones and one to insert completely new ones that don't exits (LEFT JOIN situation).

MERGE is a statement that is generally used in relational databases. Google BigQuery MERGE Command is one of the Data Manipulation Language (DML) statements. It is often used to perform three main functions atomically in one single statement. These functions are UPDATE, INSERT, and DELETE.

UPDATE or DELETE clause can be used when two or more data match.
INSERT clause can be used when two or more data are different and do not match.
The UPDATE or DELETE clause can also be used when the given data does not match the source.

This means that the Google BigQuery MERGE Command enables you to merge Google BigQuery data by updating, inserting, and deleting data from your Google BigQuery tables.

Consider this SQL:

create temp table last_online as (
    select 1 as user_id
    , timestamp('2000-10-01 00:00:01') as last_online
)
;
create temp table connection_data  (
  user_id int64
  ,timestamp timestamp
)
PARTITION BY DATE(_PARTITIONTIME)
;
insert connection_data (user_id, timestamp)
    select 2 as user_id
    , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 28 hour) as timestamp
union all
    select 1 as user_id
        , timestamp_sub(current_timestamp(),interval 20 hour) as timestamp
union all
    select 1 as user_id
    , timestamp_sub(current_timestamp(),interval 1 hour) as timestamp
;

merge last_online t
using (
  select
      user_id
    , last_online
  from
    (
        select
            user_id
        ,   max(timestamp) as last_online

        from 
            connection_data
        where
            date(_partitiontime) >= date_sub(current_date(), interval 1 day)
        group by
            user_id

    ) y

) s
on t.user_id = s.user_id
when matched then
  update set last_online = s.last_online, user_id = s.user_id
when not matched then
  insert (last_online, user_id) values (last_online, user_id)
;
select * from last_online
;

2. Counting words

Doing UNNEST() and check if the word you need is in the list you need migth be useful in many situation, i.e. data warehouse sentiment analysis:

with titles as (
    select 'Title with word foo' as title union all
    select 'Title with word bar'
)
, data as (
select 
    title, 
    split(title, ' ') as words 
from 
    titles
)
select * from data, unnest(words) words
where
    words in ('bar')

3. Using IF() statement outside of the SELECT statement

This gives us an opportunity to save some lines of code and be more eloquent code-wise. Normally you would want to put this into a sub-query, and add a filter in the where clause but you can do this instead:

with daily_revenue as (
select
      current_date() as dt
    , 100          as revenue
    union all
select
      date_sub(current_date(), interval 1 day) as dt
    , 100          as revenue
)
select
*
from daily_revenue
where
    if(revenue >101,1,0) = 1
;

Another example how NOT to use it with partitioned tables. Don't do this. This is bad example because since the matching table suffixes are probably determined dynamically (based on something in your table) you will be charged for a full table scan.

SELECT *
FROM `firebase.events`
WHERE IF(condition,
         _TABLE_SUFFIX BETWEEN '20170101' AND '20170117',
         _TABLE_SUFFIX BETWEEN '20160101' AND '20160117');

You can also use it in HAVING clause and AGGREGATE functions.

4. Using GROUP BY ROLLUP

The ROLLUP function is used to perform aggregation at multiple levels. This is useful when you have to work with dimension graphs. The following query returns the total credit spend per day by the transaction type (is_gift) specified in the where clause, and it also shows the total spend for each day and the total spend in all the dates available.

with data as (
    select
    current_timestamp() as ts           
    ,'stage'            as context_type 
    ,1                  as user_id      
    ,100                as credit_value 
    , true              as is_gift
union all
    select
    timestamp_sub(current_timestamp(), interval 24 hour) as ts           
    ,'user'             as context_type 
    ,1                  as user_id      
    ,200                as credit_value 
    ,false              as is_gift
union all
    select
    timestamp_sub(current_timestamp(), interval 24*2 hour) as ts           
    ,'user'             as context_type 
    ,3                  as user_id      
    ,300                as credit_value 
    ,true               as is_gift

)
, results as (
select 
     date(ts) as date 
    ,context_type
    ,sum(credit_value)/100 as daily_credits_spend
from data    

group by rollup(1, context_type)
order by 1
)

select
  date
  ,if(context_type is null, 'total', context_type) as context_type
  ,daily_credits_spend
from results
order by date
;

5. Convert table to JSON

Imagine you are required to convert your table into JSON object where each record is an element of nested array. This is where to_json_string() function becomes useful:

with mytable as (
 select 1 as x, 'foo' as y, true as z union all
 select 2, 'bar', false
)
select 
    concat("{", "\"MyTable\":", "[", string_agg(to_json_string(t), ","), "]", "}")
from mytable as t
;

Then you can use it anywhere: dates, marketing funnels, indices, histogram graphs, etc.

6. Using PARTITION BY

Given user_id, date and total_cost columns. For EACH date, how do you show the total revenue value for EACH customer while keeping all the rows? You can achieve this like so:

select
     date
    ,user_id
    ,total_cost
    ,sum(total_cost) over (partition by date,user_id) as revenue_per_day
from production.payment_transaction
;

7. Moving average

Very often BI developers are tasked to add a moving average to their reports and fantastic dashboards. This might be 7, 14, 30 day/month or even year MA line graph. So how do we do it?

 with dates as (
select
    dt
from 
    unnest(generate_date_array(date_sub(current_date(), interval 90 day), current_date(), interval 1 day)) as dt
)

, data as (
    select dt
        , CEIL(RAND()*1000) as revenue -- just some random data.
    from
        dates
)
select
  dt
, revenue
, AVG(revenue) OVER(ORDER BY unix_date(dt) RANGE BETWEEN 6 PRECEDING AND CURRENT ROW) as seven_day_moving_average
from data
;

8. Date arrays

Becomes really handy when you work with user retention or want to check some dataset for missing values, i.e. dates. BigQuery has a function called GENERATE_DATE_ARRAY:

select
 dt
from 
    unnest(generate_date_array('2019–12–04', '2020–09–17', interval 1 day)) as dt

9. Row_number()

SELECT *
FROM table_a a
LEFT JOIN table_b b
  ON a.id = b.id
WHERE b.date < a.date
QUALIFY ROW_NUMBER() OVER (PARTITION BY b.id ORDER BY b.date desc) = 1

10. NTILE()

Another numbering function. Really useful to monitor things like Login duration in seconds if you have a mobile app. For example, I have my App connected to Firebase and when users login I can see how long it took for them.

This function divides the rows into constant_integer_expression buckets based on row ordering and returns the 1-based bucket number that is assigned to each row. The number of rows in the buckets can differ by at most 1. The remainder values (the remainder of number of rows divided by buckets) are distributed one for each bucket, starting with bucket 1. If constant_integer_expression evaluates to NULL, 0 or negative, an error is provided.

select (case when tile = 50 then 'median' when tile = 95 then '95%' else '5%' end) as tile
    , dt
    , max(cast( round(duration/1000) as numeric)/1000 ) max_duration_s
    , min(cast( round(duration/1000) as numeric)/1000 ) min_duration_s

from (
    select 
         trace_info.duration_us duration
        , ntile(100) over (partition by (date(event_timestamp)) order by trace_info.duration_us) tile
        , date(event_timestamp) dt

    from firebase_performance.my_mobile_app 
    where 
        date(_partitiontime) >= parse_date('%y%m%d', @ds_start_date) and date(_partitiontime) <= parse_date('%y%m%d', @ds_end_date)
        and 
        date(event_timestamp) >= parse_date('%y%m%d', @ds_start_date)
        and 
        date(event_timestamp) <= parse_date('%y%m%d', @ds_end_date)
    and lower(event_type) = "duration_trace"
    and lower(event_name) = 'logon'

    -- GROUP BY DATE(_PARTITIONTIME)
) x
WHERE tile in (5, 50, 95)
group by dt, tile
order by dt
;

11. rank / dense_rank

They are also called numbering functions. I tend to use DENSE_RANK as default ranking function as it doesn't skip the next available ranking whereas RANK would. It returns consecutive rank values. You can use it with a partition which divides the results into distinct buckets. Rows in each partition receive the same ranks if they have the same values. Example:

with top_spenders as (
    select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
    select 2 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 4 as user_id, 300 as total_spend, 11   as reputation_level union all
    select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
    select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 33 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 44 as user_id, 2500 as total_spend, 22   as reputation_level 

)

select 
    user_id
    , rank() over(partition by reputation_level order by total_spend desc) as rank
    , dense_rank() over(partition by reputation_level order by total_spend desc) as dense_rank
from
    top_spenders
;

Another example with product prices:

with products as (
    
    select
        2                    as product_id      
        , 'premium_account'  as product_type    
        , 100                as total_cost   
    union all
    select
        1                    as product_id      
        , 'premium_group'    as product_type
        , 200                as total_cost
    union all
    select
        111                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
    union all
    select
        112                  as product_id      
        , 'bots'             as product_type    
        , 400                as total_cost      
    union all
    select
        113                  as product_id      
        , 'bots'             as product_type    
        , 500                as total_cost      
    union all
    select
        213                  as product_id      
        , 'bots'             as product_type    
        , 300                as total_cost      
  
)
select * from (
	select
		  product_id
		, product_type
		, total_cost as product_price
		, dense_rank () over ( 
			partition by product_type
			order by total_cost desc
		) price_rank 
	from
		products
) t
where price_rank < 3
;

12. Pivot / unpivot

Pivot changes rows to columns. It's all it does. Unpivot does the opposite.

select * from
(
  -- #1 from_item
  select 
     extract(month from dt) as mo         
    ,product_type    
    ,revenue   
  from (
    select
        date(current_date()) as dt              
        , 'premium_account'  as product_type    
        , 100                as revenue   
    union all
    select
        date_sub(current_date(), interval 1 month) as dt
        , 'premium_group'    as product_type
        , 200                as revenue
    union all
    select
        date_sub(current_date(), interval 2 month) as dt
        , 'bots'             as product_type
        , 300                as revenue
  )
)
pivot
(
  -- #2 aggregate
  avg(revenue) as avg_revenue_
  -- #3 pivot_column
  for product_type in ('premium_account', 'premium_group')
)
;

13. First_value / last_value

with top_spenders as (
    select 1 as user_id, 100 as total_spend, 11   as reputation_level union all
    select 2 as user_id, 150 as total_spend, 11   as reputation_level union all
    select 3 as user_id, 250 as total_spend, 11   as reputation_level union all
    select 11 as user_id, 1000 as total_spend, 22   as reputation_level union all
    select 22 as user_id, 1500 as total_spend, 22   as reputation_level union all
    select 33 as user_id, 2500 as total_spend, 22   as reputation_level 

)
, data as (
    select
        user_id
        ,total_spend
        ,reputation_level
        ,first_value(total_spend)
    over (partition by reputation_level order by total_spend desc
    rows between unbounded preceding and unbounded following) as top_spend
  from top_spenders
)

select
    user_id
    ,reputation_level
    ,total_spend
    ,top_spend          as top_spend_by_rep_level
    ,total_spend - top_spend as delta_in_usd
from data
;

14. Convert a table into Array of structs and pass them to UDF

This is useful when you need to apply a user defined function (UDF) with some complex logic to each row or a table. You can always consider your table as an array of TYPE STRUCT objects and then pass each one of them to UDF. It depends on your logic. For example, I use it to calculate purchase expire times:

select 
     target_id
    ,product_id
    ,product_type_id
    ,production.purchase_summary_udf()(
        ARRAY_AGG(
            STRUCT(
                target_id
                , user_id
                , product_type_id
                , product_id
                , item_count
                , days
                , expire_time_after_purchase
                , transaction_id 
                , purchase_created_at 
                , updated_at
            ) 
            order by purchase_created_at
        )
    ) AS processed

from new_batch
;

In a similar way you can create tables with no need to use UNION ALL. For example, I use it to mock some test data for unit tests. This way you can do it very fast just by using Alt+Shift+Down in your editor.

select * from unnest([
        struct
        (     
            1                                 as user_id
        ,   111                               as reputation
        ,   timestamp('2021-12-16 13:00:01')  as update_time
        
        ),

        (
            2                                 --as user_id
        ,   111                               --as reputation
        ,   timestamp('2011-12-16 13:00:01')  --as update_time
        ),

        (
            3                                 --as user_id
        ,   111                               --as reputation
        ,    timestamp(format_timestamp("%Y-%m-%d 12:59:01 UTC" ,timestamp(date_sub(current_date(), interval 0 day))))   --as update_time
        )
        ]
    ) as t

15. Creating event funnels using FOLLOWING AND UNBOUNDED FOLLOWING

Good example might be marketing funnels. Your dataset might contain continiously repeating events of the same type but ideally you would want to chain each event with next one of a different type. This might be useful when you need to get a list of something, i.e. events, purchases, etc. in order to build a funnels dataset. Working with PARTITION BY it gives you the opportunity to group all the follwoing events no matter how many of them exists ineach partition.

with d as (
select * from unnest([
  struct('0003f' as user_pseudo_id, 12322175 as user_id, timestamp '2020-10-10 16:46:59.878 UTC' as event_timestamp, 'join_group' as event_name),
  ('0003',12,timestamp '2022-10-10 16:50:03.394 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:02:38.632 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:09:38.645 UTC','set_avatar'),
  ('0003',12,timestamp '2022-10-10 17:10:38.645 UTC','join_group'),
  ('0003',12,timestamp '2022-10-10 17:15:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:17:38.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:18:38.645 UTC','in_app_purchase'),
  ('0003',12,timestamp '2022-10-10 17:19:38.645 UTC','spend_virtual_currency'),
  ('0003',12,timestamp '2022-10-10 17:19:45.645 UTC','create_group'),
  ('0003',12,timestamp '2022-10-10 17:20:38.645 UTC','set_avatar')
  ]
  ) as t)

  , event_data as (
SELECT 
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , ARRAY_AGG(
        STRUCT(
              event_name AS event_name
            , event_timestamp AS event_timestamp
        )
    ) 
    OVER(PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN 1 FOLLOWING AND  UNBOUNDED FOLLOWING ) as next_events

FROM d
WHERE
DATE(event_timestamp) = "2022-10-10" 

)
select
    user_pseudo_id
  , user_id
  , event_timestamp
  , event_name
  , (SELECT 
        event_name FROM UNNEST(next_events) next_event
    WHERE t.event_name != event_name
    ORDER BY event_timestamp  LIMIT 1
    -- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event
  , (SELECT 
        event_timestamp FROM UNNEST(next_events) next_event
    WHERE t.event_name != event_name
    ORDER BY event_timestamp  LIMIT 1
    -- change to ORDER BY event_timestamp desc if prev event needed
  ) next_event_ts

from event_data t
;

16. Regexp

You would to use it if you need to extract something from unstructured data, i.e. fx rates, custom groupings, etc.

Working with currency exchange rates using regexp

Consider this example with exchange rates data:

-- One or more digits (\d+), optional period (\.?), zero or more digits (\d*).
with object as
(select  '{"aed":3.6732,"afn":78.45934,"all":110.586428}' as rates)

, data as (
select "usd" as base_currency,
  regexp_extract_all(rates, r'"[^"]+":\d+\.?\d*') as pair
from object
)
, splits as (
select base_currency, pair, split(pair, ':') positions 
from data cross join unnest (pair) as pair
)
select base_currency, pair,  positions[offset(0)] as rate_currency,  positions[offset(1)] as rate
from splits

Working with App versions using regexp

Sometimes you might want to use regexp to get major, release or mod versions for your app and a create a custom report:

with events as (
  select  'open_chat' as event_name, '10.1.0' as app_display_version union all
  select  'open_chat' as event_name, '10.1.9' as app_display_version union all
  select  'open_chat' as event_name, '9.1.4' as app_display_version union all
  select  'open_chat' as event_name, '9.0.0' as app_display_version
)
select
     app_display_version
    ,REGEXP_EXTRACT(app_display_version, '^[^.^]*') main_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, '[0-9]+.[0-9]+') as float64) release_version
    ,safe_cast(REGEXP_EXTRACT(app_display_version, r"^[a-zA-Z0-9_.+-]+.[a-zA-Z0-9-]+\.([a-zA-Z0-9-.]+$)") as int64) as mod_version
from events
;

Conclusion

SQL is a powerful tool that helps to manipulate data. Hopefuly these SQL use cases from digital marketing will be useful for you. It's a handy skill indeed and can help you with many projects. These SQL snippets made my life a lot easier and I use at work alomost every day. More, SQL and modern data warehouses are essentials tools for data science. Its robust dialect features allow to model and visualize data with ease. Because SQL is the language that data warehouses and business intelligence professionals use, it's an excellent selection if you want to share data with them. It is the most common way to communicate with almost every data warehouse / lake solution in the market.

mshakhomirov/sql_for_beginners.md