sahlone/ALT Click And Conversion S3 ETL.md

Last active June 19, 2018 13:29

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sahlone/1836f87514fe42c014bbd0f5521b0602.js"></script>
Save sahlone/1836f87514fe42c014bbd0f5521b0602 to your computer and use it in GitHub Desktop.

Click/Conversions ETL to S3 and Athena

Raw

The scope of the gist is to define the process of loading of Clicks/Conversion data from Tracker to S3/Athena for Business processes

The process is described as :

The tracker receives the click/conversions data from outside sources and pushes to Kafka topic
The Matcher reads the Kafka topic produced by the tracker and matches the clicks and conversions data to produce the Matched conversion data
Now the Job handles the data from Kafka topic produced by the Matcher and upload the data to s3. From s3, we can define the schema in Athena and use the Athena to run the SQL queries on top of the data

Author

Whats is a data scan : What it means is when you fire a query, the data is scanned and results are produced from that data. SO the amount of data scanned by athena is included in costing. That's where partitioning can help but after talking to Gair I came to know there s no perfect criteria for queries. But Gair was ok with it as finally, we will move data to Bigquery from s3. For now partitioning will help in uploading data to s3 as we will do it in batches.
Note : Athena doesnt actually store data. Its exactly like Hive, its just store metadata and gets data on demand
Is Athena supported in Terraform: Yes there is a support for that as well
What about Glue: Glue is actually used for ETL jobs, they are actually Spark jobs running on schedule so I don't think we need that as we will manage them ourselves. The only thing they give us is automatic schema defining, we are not that lazy to not take care of that and yes you pay for every crawler operation.

BigQuerry import currently not required.