Skip to content

Instantly share code, notes, and snippets.

@maiconbaum
Created December 28, 2022 22:03
Show Gist options
  • Save maiconbaum/0a2a98901e0577262de918843fe94a3a to your computer and use it in GitHub Desktop.
Save maiconbaum/0a2a98901e0577262de918843fe94a3a to your computer and use it in GitHub Desktop.
AWS Glue Catalog Table for AWS VPC Flow Logs using Apache Hive Compatible Format and Partition Projection.
CREATE EXTERNAL TABLE IF NOT EXISTS aws_vpc_flow_logs (
`version` int,
`account_id` string,
`interface_id` string,
`srcaddr` string,
`dstaddr` string,
`srcport` int,
`dstport` int,
`protocol` bigint,
`packets` bigint,
`bytes` bigint,
`start` bigint,
`end` bigint,
`action` string,
`log_status` string,
`vpc_id` string,
`subnet_id` string,
`instance_id` string,
`tcp_flags` int,
`type` string,
`pkt_srcaddr` string,
`pkt_dstaddr` string,
`az_id` string,
`sublocation_type` string,
`sublocation_id` string,
`pkt_src_aws_service` string,
`pkt_dst_aws_service` string,
`flow_direction` string,
`traffic_path` int
)
PARTITIONED BY (
`region` string,
`year` string,
`month` int,
`day` int,
`hour` int
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://<BUCKET-NAME>/AWSLogs/aws-account-id=012345678912/aws-service=vpcflowlogs/'
TBLPROPERTIES (
"skip.header.line.count"="1",
"projection.enabled" = "true",
"projection.region.type" = "enum",
"projection.region.values" = "sa-east-1,us-east-1",
"projection.year.type" = "date",
"projection.year.format" = "yyyy",
"projection.year.range" = "2022,NOW",
"projection.year.interval" = "1",
"projection.year.unit" = "YEARS",
"projection.month.type" = "integer",
"projection.month.range" = "01,12",
"projection.month.digits" = "2",
"projection.day.type" = "integer",
"projection.day.range" = "01,31",
"projection.day.digits" = "2",
"projection.hour.type" = "integer",
"projection.hour.range" = "00,23",
"projection.hour.digits" = "2",
"storage.location.template" = "s3://<BUCKET-NAME>/AWSLogs/aws-account-id=012345678912/aws-service=vpcflowlogs/aws-region=${region}/year=${year}/month=${month}/day=${day}/hour=${hour}/"
)
@maiconbaum
Copy link
Author

So, it's easier to understand how did I ended up here looking at this question in re:Post forum.

The above AWS Athena Query creates a table in AWS Glue Catalog for AWS VPC Flow Logs using Apache Hive Compatible Format (with per-hour partition) and using the AWS Athena Partition Projection feature in order to avoid running scheduled MSCK REPAIR TABLE or ALTER TABLE ADD PARTITION.

Following is an example of the AWS VPC Flow Logs creation using Terraform:

resource "aws_flow_log" "aws_vpc" {
  log_destination      = "arn:aws:s3:::<BUCKET-NAME>"
  log_destination_type = "s3"
  traffic_type         = "ALL"
  vpc_id               = var.vpc_id
  log_format           = "$${version} $${account-id} $${action} $${interface-id} $${srcaddr} $${dstaddr} $${srcport} $${dstport} $${protocol} $${packets} $${bytes} $${start} $${end} $${log-status} $${vpc-id} $${subnet-id} $${instance-id} $${tcp-flags} $${type} $${pkt-srcaddr} $${pkt-dstaddr} $${region} $${az-id} $${sublocation-type} $${sublocation-id} $${pkt-src-aws-service} $${pkt-dst-aws-service} $${flow-direction} $${traffic-path}"
  tags                 = var.tags
  destination_options {
    file_format                = "parquet"
    hive_compatible_partitions = true
    per_hour_partition         = true
  }
}

@ryangraham
Copy link

This gist was a lifesaver both as an example and as proof that this could be done with partition projection.

I wrote up my experience here if anyone needs further references: https://medium.com/@ryandeangraham/terraforming-vpc-flow-logs-0b9defb03d67

@lars-fillmore
Copy link

This is exactly what I was after!!! Thanks for the work @maiconbaum
Your TF write up will also come in handy @ryangraham

I was wondering why go to all the effort in using the Hive partitioning style in the first place, and if I'm making things too complicated than they need to be? What does Hive provide that the standard (seemingly) partitioning style that AWS seem to use in many of their other services doesn't?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment