Bahadir Cambel bcambel

Recently an opportunity at work arose to run some ad hoc data analysis/discovery on our event data. The required analysis was a bit more complicated than what was provided by our current reporting capabilities. This complication was primarily due to both the quality and the layout/structure of the existing data. We wanted to understand how much of the revenue was coming from an internal (machine learning supported) recommendation service, for given products over various time boundaries. From these sales, we wanted to further understand which recommendation algorithm was used to drive the purchase. Measuring the performance of these algorithms would allow us to fine tune them in the future and have the ability to ascertain whether or not those changes made a positive or negative impact on the overall revenue.

It was determined early on that trying to do this in plain SQL wasn't a great fit, and having this data stored in Amazon Redshift constrained the approach to just SQL. Luckily, I had been pulling the

	explain analyse
	select sum(pv), country FROM (
	select sum(pv) as pv, country
	from events_2015_09_01
	group by country
	UNION ALL
	select sum(pv) as pv, country
	from events_2015_09_02
	group by country
	UNION ALL

	INSERT INTO events_2015_09_17
	SELECT count(id) as pv, evt_name, hour, device_type, browser, os, domain, path, utm_source,
	utm_medium, utm_campaign, utm_content, utm_term, country
	FROM events where day='2015-09-17' and evt_name='pv'
	group by 2,3,4,5,6,7,8,9,10,11,12,13,14
	order by count(id) desc;

	CREATE FUNCTION day_view_generator (day character varying) RETURNS void
	VOLATILE
	AS $$
	BEGIN
	EXECUTE format('CREATE TABLE IF NOT EXISTS events_' \|\| day \|\| ' AS (SELECT 1 as pv, evt_name, hour, device_type, browser, os, domain, path, utm_source, utm_medium, utm_campaign, utm_content, utm_term, country FROM events limit 1)');
	END;
	$$ LANGUAGE plpgsql

	explain analyse
	select count(id), country
	FROM events where (ts between '2015-09-01' and '2015-10-01') and evt_name='pv'
	GROUP BY country
	ORDER BY count(id) desc;

	Sort (cost=2667410.55..2667410.69 rows=55 width=66) (actual time=148570.492..148570.504 rows=216 loops=1)
	Sort Key: (count(id)) DESC
	Sort Method: quicksort Memory: 38kB
	-> HashAggregate (cost=2667408.41..2667408.96 rows=55 width=66) (actual time=148570.353..148570.416 rows=216 loops=1)

	CREATE TABLE
	events
	(
	account_id CHARACTER VARYING(50),
	id CHARACTER VARYING(56) NOT NULL,
	evt_id TEXT,
	evt_name CHARACTER VARYING(100),
	evt_type CHARACTER VARYING(50),
	ref_id TEXT,
	ts TIMESTAMP(6) WITHOUT TIME ZONE,

	-- Single index on ts field

	explain analyse
	SELECT count(id), domain, hour
	FROM events
	WHERE ts between '2015-09-01' and '2015-09-04' and domain in ('hackersome.com')
	GROUP BY domain,hour

	--HashAggregate (cost=1977121.02..1977126.98 rows=596 width=75) (actual time=6861.694..6861.722 rows=72 loops=1)
	-- Group Key: domain, hour

	def get_val(inst, fields):
	if isinstance(fields, basestring):
	return getattr(inst,fields, "")

	next_field = fields.pop(0)

	next_value = getattr(inst, next_field, "")

	if len(fields)>0:
	return get_val(next_value, fields)

	[program:zookeeper]
	killgroup=true
	directory=/opt/zookeeper
	command=java -Dzookeeper.log.dir=. -Dzookeeper.root.logger=INFO,CONSOLE -cp /opt/zookeeper/bin/../build/classes:/opt/zookeeper/bin/../build/lib/.jar:/opt/zookeeper/bin/../lib/slf4j-log4j12-1.6.1.jar:/opt/zookeeper/bin/../lib/slf4j-api-1.6.1.jar:/opt/zookeeper/bin/../lib/netty-3.7.0.Final.jar:/opt/zookeeper/bin/../lib/log4j-1.2.16.jar:/opt/zookeeper/bin/../lib/jline-0.9.94.jar:/opt/zookeeper/bin/../zookeeper-3.4.6.jar:/opt/zookeeper/bin/../src/java/lib/.jar:/opt/zookeeper/bin/../conf: -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.local.only=false org.apache.zookeeper.server.quorum.QuorumPeerMain /opt/zookeeper/bin/../conf/zoo.cfg
	autostart=true
	autorestart=true
	redirect_stderr=true
	stopsignal=INT
	stdout_logfile=/var/log/syslog

	import boto
	from boto import regioninfo
	from boto import ec2

	ACCESS_KEY_ID = ''
	SECRET_ACCESS_KEY = ''

	elb_region = boto.regioninfo.RegionInfo(
	name='eu-west-1',
	endpoint='elasticloadbalancing.eu-west-1.amazonaws.com')