- Date_of_month = data collection day happens at the end of each month.
- Column weeks['times'] contains 36 monthly dates from 3 previous years from date_of_month.
- Column weeks['volumes'] contain amounts for each of those 36 months.
- The column period represents some months.
- Each search and date_of_month row has one of these three periods: 1, 3, 6
Based on my understanding from the information above, here is my analysis of the formula used to solve the take-home challenge.
First, I gather all the distinct periods (number represents month) - Found three 1, 3, and 6. Then I convert these months into weeks (since we are interested in weekly avg) and associate them with the period. Based on a google search, a month equals about 4.345 weeks, so I went with it. So to calculate weeks for each period, I take the period value and multiply with 4.345. Below are the actual numbers.
period | weeks |
---|---|
1 | 4.345 |
2 | 13.05 |
3 | 26.07 |
Now that I have the periods sorted out. The next challenge was to get a weekly base average from the weeks['volumes']. Each 'weeks' column represents 36 prior months data from 'date_of_month' date. So to get the weekly average from those 36 monthly volumes, I had first to figure out roughly how many weeks are in between those months timeframe. So I subtracted the first date from the last date and used that time delta to calculate approximate weeks. That number was generally ~150 to 152ish. I used this varying number of weeks and divided the sum of volumes found for each row, and the result is my base weekly average value.
weekly_avg = sum_of_volumes / total_weeks_from_dates
So for each applicable row, I used the weekly average and multiplied it with the number of weeks representing the row's period. And this is the value I stored for that row's 'search_volume' column.
search_volume = weekly_avg * period_in_weeks
There is the helper build.sh script
I have set up the build.sh script as a switch to invoke the underlying docker-compose setup. There is one optional flag --limit <int>
available, and two choices of arguments start | kill
. The flag and its value needs to go before an argument; otherwise, the script will ignore all. Running the script without start
will directly invoke the worker container, and it's job.
Start the service.
./build.sh start
Similar to above, but adds a limit flag and tells the worker to only update the search_volume column to only the limit number of rows. Important to note, regardless of limit set, the CSV data extraction, transforming, and storing the data will be done in full scale.
./build.sh --limit 10000 start
Kill the docker-compose services. It performs a docker-compose down, destroying the containers and networks.
./build.sh kill
If the docker-compose has started and containers are up and available, then you should avoid the start
argument.
Important to note, running without any argument will fail if docker-compose is not already running.
./build.sh
You can also pass the --limit <int>
flag.
./build.sh --limit 10000
For the first time when I run the docker-compose services, it will build two service containers, one for the Postgres DB and the other is a python worker. There are two scripts db_loader.py and fix_tp_tsa_search_volume.py. After an initial check of the database, if the given table is empty, the db_loader script gets called to populate the table. db_loader extracts the data from the given trendpulse_summary_tsa_v2.csv file and then atomically stores the values in batches of 100,000 rows.
# ./build.sh start
Creating network "hr-take-home-data_db" with the default driver
Creating hr-take-home-data_db_1 ... done
Creating hr-take-home-data_worker_1 ... done
Waiting for db (ctrl + c to exit) ................ success
Load start...
Loading: 510000it [02:58, 2851.68it/s, batch 51/51]
Load completed
Upsert start...
Upserting: 500000it [30:14, 275.49it/s, batch 50/50]
Upsert completed
Goodbye!
Below are some queries made to the Postgres DB, after the worker service finished its task. Here is a breakdown of the query results.
query | result | desc |
---|---|---|
SELECT COUNT(*) FROM trendpulse_summary_tsa_v2 | 502204 | Total available rows. |
SELECT COUNT(*) FROM trendpulse_summary_tsa_v2 WHERE search_volume <> 0 | 497491 | Total rows where search_volume has been updated. |
SELECT COUNT(*) FROM trendpulse_summary_tsa_v2 WHERE search_volume = 0 | 4713 | Total rows where search_volume is still 0. Some rows have weeks data, some are empty. |
SELECT COUNT(*) FROM trendpulse_summary_tsa_v2 WHERE search_volume = 0 AND (weeks != '') IS TRUE | 1791 | Total rows where search volume is still 0, despite having weeks data. Basically, the sum of volumes found for these rows are 0, therefore the average is also 0. |
type | choices | reason |
---|---|---|
scripting | python, bash, docker-compose | Part of the challenge. |
database | Postgres | Part of the challenge. |
SQL | Peewee ORM | Ease of use. It allows us to simplify the complexities of SQL. |