This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Want to try this as Hive UDF | |
#1. Compile using Jython (never done before) | |
#2. Re-write using pure Java (don't know Java) | |
#3. Re-write using Scala (barely done first week of Coursera FP class so far | |
#Hoping to return this as a single column first | |
#Eventually want to build a Hive UDTF, to make one row containing dates into multiple rows, one per date | |
from datetime import * |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--#### Assume sales Hive table partitioned by day_id ####-- | |
--Full Table Scan | |
select | |
employees.id, | |
b.sales | |
from employees | |
left join sales on (employees.id = sales.employee_id) | |
where day_id between '2014-03-01' and '2014-05-31'; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--Hive expects that you want to calculate your percentiles by account_number and sales | |
--This code will generate an error about a missing GROUP BY statement | |
select | |
account_number, | |
sales, | |
CASE WHEN sales > percentile_approx(sales, .9) THEN 1 ELSE 0 END as top10pct_sales | |
from sales; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--Hive expects that you want to calculate your percentiles by account_number and sales | |
--This code will generate an error about a missing GROUP BY statement | |
select | |
account_number, | |
sales, | |
CASE WHEN sales > a.sales_90th_percentile from sales THEN 1 ELSE 0 END as top10pct_sales | |
from sales | |
cross join (select percentile_approx(sales, .9) as sales_90th_percentile from sales) a; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
select | |
histogram_numeric(salary, 20) as salary_hist | |
from | |
sample_08; | |
--Results | |
[{"x":23507.68627450983,"y":255.0},{"x":31881.7647058824,"y":340.0},{"x":39824.11498257844,"y":287.0},{"x":47615.58011049725,"y":181.0},{"x":55667.01219512195,"y":164.0},{"x":59952.499999999985,"y":8.0},{"x":66034.67153284674,"y":137.0},{"x":75642.31707317074,"y":82.0},{"x":82496.13636363638,"y":44.0},{"x":91431.66666666667,"y":60.0},{"x":100665.71428571428,"y":21.0},{"x":107326.66666666667,"y":15.0},{"x":121248.74999999999,"y":16.0},{"x":142070.0,"y":2.0},{"x":153896.6666666667,"y":6.0},{"x":162310.0,"y":6.0},{"x":169810.0,"y":2.0},{"x":176740.0,"y":2.0},{"x":193925.0,"y":8.0},{"x":206770.0,"y":2.0}] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT | |
CAST(hist.x as int) as bin_center, | |
CAST(hist.y as bigint) as bin_height | |
FROM (select | |
histogram_numeric(salary, 20) as salary_hist | |
from | |
sample_08) a | |
LATERAL VIEW explode(salary_hist) exploded_table as hist; | |
--Results |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
using ODBC | |
#Connect to Hadoop cluster via Hive (pre-defined Windows DSN in ODBC Manager) | |
hiveconn = ODBC.connect("Production hiveserver2"; usr="your-user-name", pwd="your-password-here") | |
#Clean data, return results directly to file | |
#Data returned with have origin of flight, flight takeoff, flight landing and elapsed time | |
hive_query_string = | |
"select | |
origin, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Connect to Aster (pre-defined Windows DSN in ODBC Manager) | |
asterconn = ODBC.connect("aster01"; usr="your-user-name", pwd="your-password") | |
#Create table to hold airline results | |
create_airline_table_statement = | |
"create table ebi_temp.airline | |
(origin varchar, | |
flight_takeoff_datetime_origin timestamp, | |
flight_landing_datetime_origin timestamp, | |
actualelapsedtime int, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Normalize timestamps from local time to UTC time | |
aster_view_string = " | |
create view temp.vw_airline_times_utc as | |
select | |
row_number() over(order by flight_takeoff_datetime_origin) as unique_flight_number, | |
origin, | |
flight_takeoff_datetime_origin, | |
flight_landing_datetime_origin, | |
flight_takeoff_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_takeoff_datetime_utc, | |
flight_landing_datetime_origin - (INTERVAL '1 hour' * timezone) as flight_landing_datetime_utc, |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#Calculate the number of flights per hour per day | |
flights_query = " | |
select | |
calendar_date, | |
hour_utc, | |
sum(1) as num_flights | |
from temp.airline_burst_hour | |
group by 1,2 | |
order by 1,2;" |