An approach for updating partitions in an existing table using CTAS queries.
With the release of CTAS functionality for Athena, you're now able to create derivative tables in Athena with different data formats or S3 locations.
Sometimes, though, you want to be able to add data to a partition of an existing table. As long as that partition doesn't already exist, you can do this with Athena by using CTAS with a temporary table.
For this example, we'll use the Amazon Reviews open dataset. This dataset is provided as both TSV and Paruqet datasets, but we'll use TSV as our source and show how to insert into a partitioned Parquet dataset.
CREATE EXTERNAL TABLE amazon_reviews_tsv(
marketplace string,
customer_id string,
review_id string,
product_id string,
product_parent string,
product_title string,
product_category string,
star_rating int,
helpful_votes int,
total_votes int,
vine string,
verified_purchase string,
review_headline string,
review_body string,
review_date bigint,
year int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://amazon-reviews-pds/tsv/'
TBLPROPERTIES ("skip.header.line.count"="1");
Run a simple query to ensure the CREATE statement worked:
SELECT * FROM amazon_reviews_tsv WHERE "$path" LIKE '%.tsv' LIMIT 10
CTAS queries can create the table for you, but since we want a partitioned output table, we're going to create it ahead of time.
For this example, we're going to partition by product_category
.
Note: Replace the S3 bucket with your own.
CREATE EXTERNAL TABLE amazon_reviews_parquet(
marketplace string,
customer_id string,
review_id string,
product_id string,
product_parent string,
product_title string,
star_rating int,
helpful_votes int,
total_votes int,
vine string,
verified_purchase string,
review_headline string,
review_body string,
review_date bigint,
year int)
PARTITIONED BY ( `product_category` string )
STORED AS PARQUET
LOCATION 's3://<bucket>/amazon_reviews_partitioned';
OK! Now we have source TSV table and a Parquet destination table with no data in it. Let's fix that.
Here's the process we'll follow for inserting partitioned data into our destination.
- Use a CTAS query to convert the TSV data to Parquet with a temporary table
- Note that the S3 location will be a partition inside the location above
- Add the new folder as a partition inside the original table
- Delete the temporary table
- We can specify any S3 location, so we'll use a partition inside the
amazon_reviews_parquet
table. - We have to manually specify the column list because we are extracting the
product_category
column as the partition and it has to be last. - The S3 location here is hard-coded with the product category we're querying for.
CREATE TABLE tmp_ctas_001 WITH (
format='PARQUET',
external_location = 's3://<bucket>/amazon_reviews_partitioned/product_category=Toys/'
)
AS SELECT marketplace, customer_id, review_id, product_id, product_parent, product_title, star_rating, helpful_votes, total_votes, vine, verified_purchase, review_headline, review_body, review_date, year, product_category
FROM amazon_reviews_tsv
WHERE product_category = 'Toys'
This query will take about a minute to run and create ~750 MiB of data in your S3 bucket.
Now that we've converted the data to Parquet and specified an S3 location as a partition inside the amazon_reviews_parquet
table, we manually add this partition.
ALTER TABLE amazon_reviews_parquet ADD PARTITION (product_category='Toys')
This only drops the table definition, not the underlying data.
DROP TABLE tmp_ctas_001
You should now be able to query the original table.
SELECT * FROM amazon_reviews_parquet LIMIT 10
SELECT COUNT(*) FROM amazon_reviews_parquet
This is a neat trick - a more flexible version of INSERT INTO