Skip to content

Instantly share code, notes, and snippets.

@farski
Last active November 21, 2024 02:08
Show Gist options
  • Save farski/2035d193d8e30ab70cb2212cff162b66 to your computer and use it in GitHub Desktop.
Save farski/2035d193d8e30ab70cb2212cff162b66 to your computer and use it in GitHub Desktop.

More snippets

Standard Access Logs

The included SQL and CloudFormation YAML snippets create AWS Glue tables for standard access logs from Amazon CloudFront, which can be queried using Amazon Athena. The snippets are intended to be equivalent.

The tables that are created utilize partition projections, which allows for efficient, cost-effective querying of the log data in Amazon S3, without the need to manual create new partitions. Instead, the partitions are created based on the filesnames of the log files. Using the year, month, and day portions of the filenames, daily partitions are available when querying the log data.

Filenames that are suitable for partition projections are only available when using v2 Standard Logging in CloudFront, which became available in November 2024. Log files created using legacy standard logging are always created with filenames that are not compatible with partition projections.

Since you can define the filenames however you want when configuring v2 logging, the LOCATION and storage.location.template values below may need to be changed to match your configuration. It's important to make sure that {yyyy}/{MM}/{dd} appears somewhere in the v2 logging partitioning file path, and that storage.location.template includes ${date} in the same place.

Date-based partitions have a specific range in the configuration, which should be set to something appropriate for your use case.

When querying a table that includes the partitions, the SELECT should include a WHERE clause that filters data covered by the range of the partition. For example:

SELECT *
FROM __DATABASE_NAME__.__TABLE_NAME__ 
WHERE "date" >= date_format(current_date - interval '7' day, '%Y/%m/%d')

The date field in the WHERE clause matches the date projection configured on the table. The name of this projection is arbitrary.

CREATE EXTERNAL TABLE IF NOT EXISTS __DATABASE_NAME__.__TABLE_NAME__ (
`date` date COMMENT "UTC date as YYYY-MM-DD",
`time` string COMMENT "UTC time when the response finished as HH:MM:SS",
location string COMMENT "Code for the edge location that served the request as XYZ1",
bytes bigint COMMENT "Total number of bytes that the server sent to the viewer",
request_ip string COMMENT "The IP address of the viewer that made the request",
method string COMMENT "The HTTP method of the request",
host string COMMENT "Domain name of the CloudFront distribution (e.g., d111111abcdef8.cloudfront.net)",
uri string COMMENT "The path portion of the request URL, without query parameters",
status int COMMENT "The HTTP status code of the response, or 000 is the viewer closed the connection",
referrer string COMMENT "The value of the Referer header in the request",
user_agent string COMMENT "The value of the User-Agent header in the request",
query_string string COMMENT "The query string portion of the request path, or a hyphen",
cookie string COMMENT "The value of the Cookie header in the request",
result_type string COMMENT "A classification of the response, such as Hit, Miss, and Error",
request_id string COMMENT "An opaque string that uniquely identifies a request",
host_header string COMMENT "The value of the Host header in the request",
request_protocol string COMMENT "The protocol of the viewer request, http, https, ws, wss ",
request_bytes bigint COMMENT "The total number of bytes in the request",
time_taken float COMMENT "Number of seconds the request took (e.g., 0.123)",
xforwarded_for string COMMENT "The value of the X-Forwarded-For header in the request, or a hyphen",
ssl_protocol string COMMENT "The SSL/TLS protocol that the viewer and server negotiated, or a hyphen",
ssl_cipher string COMMENT "The SSL/TLS cipher that the viewer and server negotiated, or a hyphen",
response_result_type string COMMENT "A classification of the response, such as Hit, Miss, and Error",
http_version string COMMENT "The HTTP version of the request",
fle_status string COMMENT "A code that indicates whether the request body was successfully processed, or a hyphen",
fle_encrypted_fields int COMMENT "The number of encrypted fields forwarded to the origin, or a hyphen",
c_port int COMMENT "The port number of the request",
time_to_first_byte float COMMENT "The number of seconds between receiving the request and writing the first byte of the response",
x_edge_detailed_result_type string COMMENT "When the result type is an error, this is the type of error, othewise it's the result type",
sc_content_type string COMMENT "The value of the Content-Type header in the response",
sc_content_len bigint COMMENT "The value of the Content-Length header in the response",
sc_range_start bigint COMMENT "The start value of the Content-Range header in the response, or a hyphen",
sc_range_end bigint COMMENT "The end value of the Content-Range header in the response, or a hyphen"
)
COMMENT "TKTKTK A description of the table"
PARTITIONED BY (
`date` string
)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
WITH SERDEPROPERTIES (
"field.delim" = "\t",
"serialization.format" = "\t"
)
STORED AS
INPUTFORMAT "org.apache.hadoop.mapred.TextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION "s3://__BUCKET_AND_PREFIX__/__DistributionId__/"
TBLPROPERTIES (
"skip.header.line.count" = "1"
"projection.enabled" = "true",
"projection.date.type" = "date",
"projection.date.range" = "2024/11/01,NOW",
"projection.date.format" = "yyyy/MM/dd",
"projection.date.interval" = "1",
"projection.date.interval.unit" = "DAYS",
"storage.location.template" = "s3://__BUCKET_AND_PREFIX__/__DistributionId__/${date}"
);
AccessLogsGlueTable:
Type: AWS::Glue::Table
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: __DATABASE_NAME__
TableInput:
Description: TKTKTK A description of the table
Name: __TABLE_NAME__
Parameters:
skip.header.line.count: "2"
# The name of the projection (e.g., date in projection.date.format) is arbitrary, but
# does need to match in the projection definition, the storage location, the PartitionKeys,
# and any queries made against the table that should honor the partitions.
projection.enabled: "true"
projection.date.type: date
projection.date.range: 2024/11/01,NOW
projection.date.format: yyyy/MM/dd
projection.date.interval: "1"
projection.date.interval.unit: DAYS
storage.location.template: s3://__BUCKET_AND_PREFIX__/__DistributionId__/${date}
PartitionKeys:
- Name: date
Type: string
StorageDescriptor:
Columns:
- Name: date
Type: date
Comment: UTC date as YYYY-MM-DD
- Name: time
Type: string
Comment: UTC time when the response finished as HH:MM:SS
- Name: location
Type: string
Comment: Code for the edge location that served the request as XYZ1
- Name: bytes
Type: bigint
Comment: Total number of bytes that the server sent to the viewer
- Name: request_ip
Type: string
Comment: The IP address of the viewer that made the request
- Name: method
Type: string
Comment: The HTTP method of the request
- Name: host
Type: string
Comment: Domain name of the CloudFront distribution (e.g., d111111abcdef8.cloudfront.net)
- Name: uri
Type: string
Comment: The path portion of the request URL, without query parameters
- Name: status
Type: int
Comment: The HTTP status code of the response, or 000 is the viewer closed the connection
- Name: referrer
Type: string
Comment: The value of the Referer header in the request
- Name: user_agent
Type: string
Comment: The value of the User-Agent header in the request
- Name: query_string
Type: string
Comment: The query string portion of the request path, or a hyphen
- Name: cookie
Type: string
Comment: The value of the Cookie header in the request
- Name: result_type
Type: string
Comment: A classification of the response, such as Hit, Miss, and Error
- Name: request_id
Type: string
Comment: An opaque string that uniquely identifies a request
- Name: host_header
Type: string
Comment: The value of the Host header in the request
- Name: request_protocol
Type: string
Comment: The protocol of the viewer request, http, https, ws, wss
- Name: request_bytes
Type: bigint
Comment: The total number of bytes in the request
- Name: time_taken
Type: float
Comment: Number of seconds the request took (e.g., 0.123)
- Name: xforwarded_for
Type: string
Comment: The value of the X-Forwarded-For header in the request, or a hyphen
- Name: ssl_protocol
Type: string
Comment: The SSL/TLS protocol that the viewer and server negotiated, or a hyphen
- Name: ssl_cipher
Type: string
Comment: The SSL/TLS cipher that the viewer and server negotiated, or a hyphen
- Name: response_result_type
Type: string
Comment: A classification of the response, such as Hit, Miss, and Error
- Name: http_version
Type: string
Comment: The HTTP version of the request
- Name: fle_status
Type: string
Commen: A code that indicates whether the request body was successfully processed, or a hyphen
- Name: fle_encrypted_fields
Type: int
Comment: The number of encrypted fields forwarded to the origin, or a hyphen
- Name: c_port
Type: int
Comment: The port number of the request
- Name: time_to_first_byte
Type: float
Comment: The number of seconds between receiving the request and writing the first byte of the response
- Name: x_edge_detailed_result_type
Type: string
Comment: When the result type is an error, this is the type of error, othewise it's the result type
- Name: sc_content_type
Type: string
Comment: The value of the Content-Type header in the response
- Name: sc_content_len
Type: bigint
Comment: The value of the Content-Length header in the response
- Name: sc_range_start
Type: bigint
Comment: The start value of the Content-Range header in the response, or a hyphen
- Name: sc_range_end
Type: bigint
Comment: The end value of the Content-Range header in the response, or a hyphen
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: !Sub s3://__BUCKET_AND_PREFIX__/__DistributionId__/
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
SerdeInfo:
Parameters:
field.delim: "\t"
serialization.format: "\t"
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
TableType: EXTERNAL_TABLE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment