sahlone/ALT Click And Conversion S3 ETL.md

Last active June 19, 2018 13:29

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/sahlone/1836f87514fe42c014bbd0f5521b0602.js"></script>
Save sahlone/1836f87514fe42c014bbd0f5521b0602 to your computer and use it in GitHub Desktop.

Click/Conversions ETL to S3 and Athena

Raw

The scope of the gist is to define the process of loading of Clicks/Conversion data from Tracker to S3/Athena for Business processes

The process is described as :

The tracker receives the click/conversions data from outside sources and pushes to Kafka topic
The Matcher reads the Kafka topic produced by the tracker and matches the clicks and conversions data to produce the Matched conversion data
Now the Job handles the data from Kafka topic produced by the Matcher and upload the data to s3. From s3, we can define the schema in Athena and use the Athena to run the SQL queries on top of the data

Author

sahlone commented Jun 8, 2018 •

edited

Loading

The ETL job to push to big query will not be a part of System for now as discussed with Gaiar.
We need some insights here for the type of format we should use and how about defining the process
I have also added the screenshot of formats that we can use and what flexibility we get with the type of format

vincentdaniel commented Jun 8, 2018

One comment about the goal:

I think we will need to send clicks and conversion to S3, not only "matched conversions" from matcher (need to confirm with Gaiar). That would mean you would have to consume 3 different topics: clicks, conversions, matched conversions

Some questions:

What does it mean to "scan data" on every query? Does it mean it will see if there is any new file or will it read all of the files every time?
Gaiar mentioned glue as well. Isn't it needed at all?
Will we need to define table schemas in Athena? If so, is there any terraform plugin to help us manage that?

Author

sahlone commented Jun 8, 2018 •

edited

Loading

@vincentdaniel

Whats is a data scan : What it means is when you fire a query, the data is scanned and results are produced from that data. SO the amount of data scanned by athena is included in costing. That's where partitioning can help but after talking to Gair I came to know there s no perfect criteria for queries. But Gair was ok with it as finally, we will move data to Bigquery from s3. For now partitioning will help in uploading data to s3 as we will do it in batches.
Note : Athena doesnt actually store data. Its exactly like Hive, its just store metadata and gets data on demand
Is Athena supported in Terraform: Yes there is a support for that as well
What about Glue: Glue is actually used for ETL jobs, they are actually Spark jobs running on schedule so I don't think we need that as we will manage them ourselves. The only thing they give us is automatic schema defining, we are not that lazy to not take care of that and yes you pay for every crawler operation.

gaiar commented Jun 19, 2018

BigQuerry import currently not required.

sahlone/ALT Click And Conversion S3 ETL.md

sahlone commented Jun 8, 2018 •

edited

Loading

Uh oh!

vincentdaniel commented Jun 8, 2018

Uh oh!

sahlone commented Jun 8, 2018 •

edited

Loading

Uh oh!

gaiar commented Jun 19, 2018

Uh oh!

sahlone/ALT Click And Conversion S3 ETL.md

sahlone commented Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vincentdaniel commented Jun 8, 2018

Uh oh!

sahlone commented Jun 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gaiar commented Jun 19, 2018

Uh oh!

sahlone commented Jun 8, 2018 •

edited

Loading

sahlone commented Jun 8, 2018 •

edited

Loading