Skip to content

Instantly share code, notes, and snippets.

View randyzwitch's full-sized avatar

Randy Zwitch randyzwitch

View GitHub Profile
@randyzwitch
randyzwitch / udf.py
Last active August 29, 2015 14:01
Building Hive UDF
#Want to try this as Hive UDF
#1. Compile using Jython (never done before)
#2. Re-write using pure Java (don't know Java)
#3. Re-write using Scala (barely done first week of Coursera FP class so far
#Hoping to return this as a single column first
#Eventually want to build a Hive UDTF, to make one row containing dates into multiple rows, one per date
from datetime import *
@randyzwitch
randyzwitch / hive-predicate-pushdown.sql
Last active August 29, 2015 14:02
Hive Full Table Scan vs. Predicate Pushdown on Outer Join
--#### Assume sales Hive table partitioned by day_id ####--
--Full Table Scan
select
employees.id,
b.sales
from employees
left join sales on (employees.id = sales.employee_id)
where day_id between '2014-03-01' and '2014-05-31';
@randyzwitch
randyzwitch / percentile-hive-wrong-way.sql
Created June 12, 2014 15:59
Percentiles in Hive - Wrong Way
--Hive expects that you want to calculate your percentiles by account_number and sales
--This code will generate an error about a missing GROUP BY statement
select
account_number,
sales,
CASE WHEN sales > percentile_approx(sales, .9) THEN 1 ELSE 0 END as top10pct_sales
from sales;
@randyzwitch
randyzwitch / percentile-hive-right-way.sql
Created June 12, 2014 16:03
Hive Percentile using Cross Join
--Hive expects that you want to calculate your percentiles by account_number and sales
--This code will generate an error about a missing GROUP BY statement
select
account_number,
sales,
CASE WHEN sales > a.sales_90th_percentile from sales THEN 1 ELSE 0 END as top10pct_sales
from sales
cross join (select percentile_approx(sales, .9) as sales_90th_percentile from sales) a;
@randyzwitch
randyzwitch / hive-histogram.sql
Last active August 29, 2015 14:02
Hive Histogram
select
histogram_numeric(salary, 20) as salary_hist
from
sample_08;
--Results
[{"x":23507.68627450983,"y":255.0},{"x":31881.7647058824,"y":340.0},{"x":39824.11498257844,"y":287.0},{"x":47615.58011049725,"y":181.0},{"x":55667.01219512195,"y":164.0},{"x":59952.499999999985,"y":8.0},{"x":66034.67153284674,"y":137.0},{"x":75642.31707317074,"y":82.0},{"x":82496.13636363638,"y":44.0},{"x":91431.66666666667,"y":60.0},{"x":100665.71428571428,"y":21.0},{"x":107326.66666666667,"y":15.0},{"x":121248.74999999999,"y":16.0},{"x":142070.0,"y":2.0},{"x":153896.6666666667,"y":6.0},{"x":162310.0,"y":6.0},{"x":169810.0,"y":2.0},{"x":176740.0,"y":2.0},{"x":193925.0,"y":8.0},{"x":206770.0,"y":2.0}]
SELECT
CAST(hist.x as int) as bin_center,
CAST(hist.y as bigint) as bin_height
FROM (select
histogram_numeric(salary, 20) as salary_hist
from
sample_08) a
LATERAL VIEW explode(salary_hist) exploded_table as hist;
--Results
@randyzwitch
randyzwitch / hive-odbc.jl
Created June 24, 2014 00:35
Hive ODBC.jl
using ODBC
#Connect to Hadoop cluster via Hive (pre-defined Windows DSN in ODBC Manager)
hiveconn = ODBC.connect("Production hiveserver2"; usr="your-user-name", pwd="your-password-here")
#Clean data, return results directly to file
#Data returned with have origin of flight, flight takeoff, flight landing and elapsed time
hive_query_string =
"select
origin,
#Connect to Aster (pre-defined Windows DSN in ODBC Manager)
asterconn = ODBC.connect("aster01"; usr="your-user-name", pwd="your-password")
#Create table to hold airline results
create_airline_table_statement =
"create table ebi_temp.airline
(origin varchar,
flight_takeoff_datetime_origin timestamp,
flight_landing_datetime_origin timestamp,
actualelapsedtime int,
@randyzwitch
randyzwitch / aster-julia-2.jl
Last active August 29, 2015 14:02
Aster Julia 2
#Normalize timestamps from local time to UTC time
aster_view_string = "
create view temp.vw_airline_times_utc as
select
row_number() over(order by flight_takeoff_datetime_origin) as unique_flight_number,
origin,
flight_takeoff_datetime_origin,
flight_landing_datetime_origin,
flight_takeoff_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_takeoff_datetime_utc,
flight_landing_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_landing_datetime_utc,
@randyzwitch
randyzwitch / aster-julia-gadfly.jl
Created June 24, 2014 01:10
Aster Julia Gadfly
#Calculate the number of flights per hour per day
flights_query = "
select
calendar_date,
hour_utc,
sum(1) as num_flights
from temp.airline_burst_hour
group by 1,2
order by 1,2;"