Foundational understanding of Amazon Redshift, a distributed columnar database within the AWS cloud. Integration into the AWS ecosystem and its SQL constructs, including common table expressions and analytics queries. Create tables, manage data distribution, handle raw data files, and extract information from JSON. Optimize queries, tune performance, and enhance data security.
By Jason Myers, Co-Author of Essential SQLAlchemy and Software Engineer
- Distributed Columnar database
- PostgreSQL 9 syntax with some enhancements
- Serverless and provisioned clusters
- Supports external schemas
- Data types
- Basic data types - numeric, character, boolean, datetime
- Special data types -
super
- arrays, tuples, sjon,varbyte
- Casting data types and
TO
functions
- Viewing databases and schemas
SVV_REDSHIFT_SCHEMAS
(internal) orSVV_ALL_SCHEMAS
(internal and external)SVV_REDSHIFT_TABLES
(internal) orSVV_ALL_TABLES
(internal and external)
-- View the table name
SELECT table_name
-- Using a view with both internal and external tables
FROM SVV_ALL_TABLES
-- In the external spectrumdb schema
WHERE schema_name = 'spectrumdb';
-- View the column details
SELECT column_name,
data_type,
character_maximum_length AS character_max_len,
numeric_precision,
numeric_scale
-- Using a view with both internal and external schema's columns
FROM SVV_ALL_COLUMNS
-- Only in the spectrumdb schema
WHERE schema_name = 'spectrumdb'
-- For the ecommerce_sales table
AND table_name = 'ecommerce_sales';
SELECT CAST(2.00 AS INTEGER) AS our_int;
-- Convert the string to a date
SELECT CAST('14-01-2024 02:36:48' AS DATE) AS out_date;
-- Parse the string to a date
SELECT TO_DATE('14-01-2024 02:36:48', 'DD-MM-YYYY') AS our_date;
-- Get the name of the month in a date
SELECT TO_CHAR(date '2024-01-14', 'MONTH') AS month_name;
- Redshift cluster architecture
- Leader node - provides connections, builds and distributes query execution plans, can execute entire queries, has exclusive functions
- Compute node - provides data storage, executes code from the leader node on locally stored data
- Internal database components
- MetaData Catalog
- Query Engine
- Storage
- Common table expressions (CTEs)
- Temporary result set
- Simplify queries
- Alternative to subqueries
- Performance - same as subqueries, better than subqueries if reused
- Date and time functions - truncating, getting parts of, comparing, difference, incrementing
-- CTE for top 10 divions by revenue
WITH top_ten_divisions_by_rev AS (
SELECT division_id,
SUM(revenue) AS revenue_total
FROM orders
GROUP BY division_id
ORDER BY revenue_total DESC
LIMIT 10)
-- Main query
SELECT division_id, revenue_total
-- FROM our CTE
FROM top_ten_divisions_by_rev
WHERE revenue_total > 100000;
-- Truncate to hour based on SYSDATE of 2024-01-27 20:05:55.976353
SELECT DATE_TRUNC('minute', SYSDATE);
-- Get current month based on SYSDATE of 2024-01-27 20:05:55.976353
SELECT DATE_PART(month, SYSDATE);
-- Days till end of first quarter based on SYSDATE of 2024-01-27 20:05:55.976353
SELECT DATEDIFF(day,TRUNC(SYSDATE),'2024-03-31') AS days_diff;
- Windowing in Redshift
- Windows functions
- Operate on a window (partition) of data with a value for every row in that window
- Group functions aggregate result rows, but window functions do not
- Defined via an OVER clause
- Three main concepts
Partitioning
- forming groups of rows ( PARTITION BY )Ordering
- order within each partition ( ORDER BY )Framing
- optional with additional restrictions on the rowsLAG
andLEAD
help us get data for a row from one above(before) or below(after) it in the window according to theORDER BY
clauseRANK
allows us to rank a value over the window according to the ORDER BY clause starting with 1
SELECT division_id,
sale_date,
revenue,
-- Calculate the average revenue
AVG(revenue) OVER (
-- By division for each year and month
PARTITION BY division_id,
DATE_PART('year', sale_date)
DATE_PART('month', sale_date),
) AS month_avg_revenue
FROM orders
ORDER BY division_id,
sale_date DESC;
- Transactions
- Default Execution Behavior - each SQL statement is a transaction!
- Transactions Affect some Functions -
SYSDATE
,TIMESTAMP
,CURRENT_DATE
- Some functions skirt around Transactions -
GETDATE
,TIMEOFDAY
-- Start a transaction
BEGIN;
SELECT name, priority,
FROM data_log
-- SYSDATE = 2024-02-07 00:17:24.259227
WHERE intake_ts < SYSDATE;
SELECT name, data_size,
FROM data_details
-- SYSDATE = 2024-02-07 00:17:24.259227
WHERE current_intake_date < SYSDATE;
-- End a transaction
END;
- Distributing data
- Distributed across compute nodes
- Uses Redshift internal row id,
DISTKEY
, orPRIMARY KEY
- Several distribution styles -
ALL
,KEY
,EVEN
,AUTO
CREATE TABLE IDAHO_SITE_ID
(
'pk_siteid' INTEGER PRIMARY KEY,
'fk_monitoringlocation' INTEGER,
'organization_id' VARCHAR(31),
'organizationformalname' VARCHAR(68),
'organization' VARCHAR(16)
)
DISTKEY(organization_id)
-- Sets the distribution style to key
DISTSTYLE KEY
-- Sets the data sort key to fk_monitoringlocation
SORTKEY(fk_monitoringlocation)
-- Sets fk_monitoringlocation, organization_id as compound sort keys
COMPOUND SORTKEY(fk_monitoringlocation, organization_id);
- External Schemas, File, and Table Formats
- When metadata catalog and storage are not part of the cluster it is considered external
- Redshift is the engine, uses AWS Glue Data Catalog and AWS S3 storage by default
- CSV external table
- Querying spectrum tables
- Works just like querying internal tables
EXPLAIN
will look different- No concern for DISTKEY or SORTKEY
- Pseudocolumns
$path
- shows the file storage path for the row$size
- shows the file size for the row
- Table formats -
Hive
,Iceberg
,Hudi
,Deltalake
,Read only
- Some like
Hive
need an external catalog other than AWS Glue
- Some like
CREATE TABLE spectrumdb.IDAHO_SITE_ID
(
'pk_siteid' INTEGER PRIMARY KEY,
-- Cutting the rest of columns for space
)
-- CSV rows are comma delimited
ROW FORMAT DELIMITED
-- CSV fields are terminated by a comma
FIELDS TERMINATED BY ','
-- CSVs are a type of text file
STORED AS TEXTFILE
-- This is where the data is in AWS S3
LOCATION 's3://spectrum-id/idaho_sites/'
-- This file has headers that we want to skip
TABLE PROPERTIES ('skip.header.line.count'='1');
- Semistructured data
- Stored in Redshift's SUPER type
- Has dedicated SQL functions for different formats
JSON
is a type of semistructured dataIS_VALID_JSON
- makes sure an entire JSON object is validJSON_EXTRACT_PATH_TEXT
returns the value at a json path- Extracting from nested JSON paths
- Extracting from a JSON array
- Extracting from a nested JSON array
- Casting in CTEs
SELECT IS_VALID_JSON('{"one":1, "two":2}');
SELECT JSON_EXTRACT_PATH_TEXT('{"one":1, "two":2}', 'one');
SELECT JSON_EXTRACT_PATH_TEXT('
{
"one_object":{
"nested_three": 3,
"nested_four":4
},
"two":2
}',
'one_object', 'nested_three');
SELECT JSON_EXTRACT_ARRAY_ELEMENT_TEXT('[1.1,400,13]', 2);
WITH location_details AS (
SELECT '{
"location": "Salmon Challis National Forest",
}'::SUPER::VARCHAR AS data
)
- Writing optimal queries
- Avoid
SELECT *
- Don't select columns you don't need in the result
- Remember that Redshift is columnar and pulls data by column
- Use
DISTKEY
andSORTKEY
- Use subqueries wisely
- Avoid
- Understanding query performance
- Query optimization process
- Check the
STL_ALERT_EVENT_LOG
table - contains any Redshift alerts triggered by a query - Run
EXPLAIN
on your query - Check the
SVL_QUERY_SUMMARY
andSVL_QUERY_REPORT
tables
- Check the
- Query optimization process
- Redshift security
- Column level access control
- Row level security via policies
- Data masking via policies
- Policy that obscures values returned by a query
- Only a super user or someone granted can see them
- Uses National ID numbers (e.g. Social Security Number), Credit cards
SELECT *
FROM SVV_COLUMN_PRIVILEGES
WHERE relation_name = 'products';
CREATE RLS POLICY policy_books
WITH (category VARCHAR(255))
USING (category = 'Dark Academia');
SELECT polname AS policy_name,
polatts AS column_details,
polqual AS condition
FROM SVV_RLS_POLICY;