Skip to content

Instantly share code, notes, and snippets.

@brshallo
Last active October 11, 2024 18:16
Show Gist options
  • Save brshallo/c66ab2a0dad36fdfcaa91d13f4c887fe to your computer and use it in GitHub Desktop.
Save brshallo/c66ab2a0dad36fdfcaa91d13f4c887fe to your computer and use it in GitHub Desktop.
Example of how doing UNION ALL can return fewer rows

SQL Example: Summarizing Data with UNION ALL vs. GROUP BY

Data Setup

We create a table with products and segments:

-- Create the sample table
CREATE TABLE sales_data (
    product VARCHAR(10),
    segment VARCHAR(10),
    value INT
);

-- Insert data with all product-segment combinations (each having 2 values)
INSERT INTO sales_data (product, segment, value) VALUES
('A', 'Group1', 150), ('A', 'Group1', 200),
('A', 'Group2', 180), ('A', 'Group2', 220),
('A', 'Group3', 160), ('A', 'Group3', 190),
('A', 'Group4', 170), ('A', 'Group4', 210),
('A', 'Group5', 190), ('A', 'Group5', 220),
('B', 'Group1', 140), ('B', 'Group1', 180),
('B', 'Group2', 150), ('B', 'Group2', 200),
('B', 'Group3', 170), ('B', 'Group3', 230),
('B', 'Group4', 190), ('B', 'Group4', 240),
('B', 'Group5', 210), ('B', 'Group5', 250),
('C', 'Group1', 130), ('C', 'Group1', 170),
('C', 'Group2', 140), ('C', 'Group2', 180),
('C', 'Group3', 150), ('C', 'Group3', 200),
('C', 'Group4', 160), ('C', 'Group4', 210),
('C', 'Group5', 170), ('C', 'Group5', 220),
('D', 'Group1', 200), ('D', 'Group1', 250),
('D', 'Group2', 210), ('D', 'Group2', 270),
('D', 'Group3', 220), ('D', 'Group3', 280),
('D', 'Group4', 230), ('D', 'Group4', 290),
('D', 'Group5', 240), ('D', 'Group5', 300);

-- Approach 1: Using Separate Queries with UNION ALL
-- Summarize by product
SELECT 
    product AS products_segments,
    SUM(value) AS total_value
FROM 
    sales_data
GROUP BY 
    product

UNION ALL

-- Summarize by segment
SELECT 
    segment AS products_segments,
    SUM(value) AS total_value
FROM 
    sales_data
GROUP BY 
    segment;

-- Sample Output for Approach 1
-- | products_segments | total_value |
-- |-------------------|-------------|
-- | A                 | 1700        |
-- | B                 | 2140        |
-- | C                 | 1850        |
-- | D                 | 2480        |
-- | Group1            | 1500        |
-- | Group2            | 1780        |
-- | Group3            | 1930        |
-- | Group4            | 2240        |
-- | Group5            | 2440        |

-- Approach 2: Single Group By Query on Both Product and Segment
SELECT 
    product,
    segment,
    SUM(value) AS total_value
FROM 
    sales_data
GROUP BY 
    product, segment;

-- Sample Output for Approach 2
-- | product | segment | total_value |
-- |---------|---------|-------------|
-- | A       | Group1  | 350         |
-- | A       | Group2  | 400         |
-- | A       | Group3  | 350         |
-- | A       | Group4  | 380         |
-- | A       | Group5  | 410         |
-- | B       | Group1  | 320         |
-- | B       | Group2  | 350         |
-- | B       | Group3  | 400         |
-- | B       | Group4  | 430         |
-- | B       | Group5  | 460         |
-- | C       | Group1  | 300         |
-- | C       | Group2  | 320         |
-- | C       | Group3  | 350         |
-- | C       | Group4  | 370         |
-- | C       | Group5  | 390         |
-- | D       | Group1  | 450         |
-- | D       | Group2  | 480         |
-- | D       | Group3  | 500         |
-- | D       | Group4  | 520         |
-- | D       | Group5  | 540         |

Summary

  • UNION ALL Approach: Fewer rows, higher-level summary that separately aggregates totals for products and segments.
  • Group By Product and Segment: More rows with a detailed breakdown with a more detailed summary by each product-segment combination.
  • Of course you can get the first approach from the data outputted by the second approach with another simple aggregation. However if you only need either Product aggregations or Segment aggregations but not aggregates of product-segment combinations and your goal is to transfer as small amount of data as possible than the UNION ALL approach returns fewer rows -- in this case 9 rows (5 + 4) rather than 20 (5 * 4). These differences become more dramatic if you increase the number of unique values in each variable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment