These snippets are based on various various AWS documents, blogs, etc, but have been revised for clarity and consistency, and to incorprate modern best practices.
- Amazon Athena SQL SELECT
- Amazon Athena SQL CREATE TABLE
- Amazon Athena SQL Reserved words
Partition projection is a feature of Athena that allows for querying of large databases (i.e., data in S3) more efficiently without having to create partitions ahead of time. You can think of them as just-in-time ephemeral partitions that are created based on SELECT
criteria at the time of the query.
There's an S3 bucket called
planet-data
that contains data files for different planets. The files are all prefixed with the name of the planet (e.g,s3://planet-data/Earth/…
ands3://planet-data/Mars/…
). If no partitions are created for an Athena table that queries this bucket, all files for all planets would be scanned on every query, even when the query is only looking at a single planet. So when running a query likeSELECT * WHERE planet_id = 'Earth'
, all the files, including those with theMars/
,Saturn/
, etc prefixes, will be loaded and scanned, even though the result would be only records that pertain to Earth which all came out of theEarth/
prefix. This leads to a lot of unnecessary data being scanned, which increases operational costs when using Athena.
To reduce the amount of data being scanned, Athena can use partitions. A partition allows Athena to scan a subset of objects in S3. In this example, creating partitions for each planet would allow you to query for reports for a specific planet, and only the objects with the matching prefix would be scanned during the query.
Partitions can be defined explicitly, by adding metadata to the table which, in this case, would say: create partitions for each planet name, and the possible names are
Earth
,Mars
, etc. Creating partitions this way can be done manually, or it can be automated using Glue Crawlers or other methods, but the partitions must exist before queries can find the related data in S3. So if a new planet calledHD110082b
was discovered and reports started getting added to the S3 bucket for it with a new prefix, but no partition was created for that planet, Athena wouldn't be able to scan those reports using a partition.
Alternatively, Athena supports partition projection, which allows for scanning partitioned S3 data without needing to define those partitions ahead of time. Instead, a projection is defined, which maps a field in a
SELECT
query to a prefix (or partial prefix) in the data being scanned in S3. If a projection is defined calledPlanetName
, it can be used in a template that describes objects in S3 (likes3://planet-data/${PlanetName}/
). Now, when a query includes aWHERE
clause that specifies one or morePlanetName
values (e.g.,SELECT * WHERE PlanetName = 'Earth'
), Athena will scan for objects objects in S3 using the template (e.g.,s3://planet-data/Earth/
). This matches the structure of the data in the bucket from the beginning of the example, without having explicitly defined any partitions.
Projection partitions are commonly used when querying time-based files, such as log files. If log files are being automatically delivered to a bucket with a naming structure like myBucket/2021/05/03/log1.csv
, a partition projection can be defined to automatically partition the data by year, month, and day, so that when running a SELECT
for a specific set of days, only the log files for those days are scanned.
You can see what's happening under the covers in a case like this if you turn on access logging for the bucket underlying the Athena table.
Given a date-based daily partition projection called
log_date
, and a projection template ofs3://myBucket/myLogs/${log_date}
, imagine a query likeSELECT * from myLogsTable WHERE log_date >= date('2021-05-05') AND log_date < date('2021-05-07')
. This query is selecting all logs for May 5th and 6th of 2021. The table's partition projection has a daily resolution, so only the partitions for those two days will be needed by the query, meaning only the log files for those two days will be scanned from S3.
Examining the S3 access logs for the bucket after running the query would reveal that Athena is making a
ListObjectsV2
API call on the bucket for each partition, and then aGetObject
call for each object that was returned in the list.
For example:
"GET /?list-type=2&prefix=myLogs%2F2021%205%2F05%2F&fetch-owner=false HTTP/1.1"
"GET /myLogs/2021/05/05/log1.csv HTTP/1.1"
"GET /myLogs/2021/05/05/log2.csv HTTP/1.1"
"GET /myLogs/2021/05/05/log3.csv HTTP/1.1"
"GET /?list-type=2&prefix=myLogs%2F2021%205%2F06%2F&fetch-owner=false HTTP/1.1"
"GET /myLogs/2021/05/06/log1.csv HTTP/1.1"
"GET /myLogs/2021/05/06/log2.csv HTTP/1.1"
Hopefully that example demonstrates how a partition projection defined on a table can be used in a SELECT
query to limit which objects in S3 are actually accessed by Athena.
Creating partition projections often leads to having several similar fields in the query results: the fields from the data itself, and the fields created as part of the projection. In the previous example, this would be the log_date
field and perhaps a timestamp
value included in each log line. It's important to remember that Athena will only utilize partitions when the partition projection field is used. If log_date
were omitted from a SELECT
that was filtering using timestamp
, the results would only include the rows from the filter, but Athena would have scanned the entire, unpartitioned set of data from S3.
To filter a query using partitioning and with a resolution that is greater than the pertions, both the data fields and partition projection fields would be used in the same query.
Note: There is currently a limitation with partion projections, where a /
is always included at the end of the prefix
parameter of the ListObjectsV2
request, even if one was not included in the projection template. This makes it impossible to use projection partitions for any S3 objects that don't actually have a /
in the key at that particular spot (which is the case for a number of AWS service log files).