Skip to content

Instantly share code, notes, and snippets.

@SiddheshKukade
Created May 25, 2023 08:56
Show Gist options
  • Save SiddheshKukade/8f3f1da70ede978974dcd46b23c7a243 to your computer and use it in GitHub Desktop.
Save SiddheshKukade/8f3f1da70ede978974dcd46b23c7a243 to your computer and use it in GitHub Desktop.
For DSBDA Hive Assignment

Requirements :

  • Local hive start
cd /usr/local/hive/bin
./hive
to close : exit ./hive
  • flights_info.txt 200K flight records
  • info.text contains table description
year INT,
  month INT,
  day INT,
  ..
  ...
  late_aircraft_delay STRING

Steps:

  1. create dabase sid
  2. use sid
  3. show tables NO OUTPUT
  4. Insert the cols from info.txt here.
 create table first (year INT,
       month INT,
       day INT,
       day_of_week INT,
       dep_time INT,
       crs_dep_time INT,
       arr_time INT,
       crs_arr_time INT,
       unique_carrier STRING,
       flight_num INT,
       tail_num STRING,
       actual_elapsed_time INT,
       crs_elapsed_time INT,
       air_time INT,
       arr_delay INT,
       dep_delay INT,
       origin STRING,
       dest STRING,
       distance INT,
       taxi_in INT,
       taxi_out INT,
       cancelled INT,
       cancellation_code STRING,
       diverted INT,
       carrier_delay STRING,
       weather_delay STRING,
       nas_delay STRING,
       security_delay STRING,
       late_aircraft_delay STRING)
    row format delimited
    fields terminated by ",";

and then load data

hive> load data local inpath "/home/cloudera/Downloads/flight_info.txt" into table first;
  1. check with desc first;
  2. create second table
create table second as select flight_num , origin_dest, arr_time, dep_time,
arr_delay, dep_delay from first;
  1. Check select * from second limit 50;
  2. table creation one

Now answers for questions for second table

1. Find Average departure delay.

select avg(dep_delay) from second;

2. Find Average arrival deplay

select avg(arr_delay) from second;

3. Display all the rows where destination is "TPA"

select * from flights where dest="TPA";

4. Display all the rows where origin is "IND"

select * from flights where origin="IND"   

5. Count total no of flights coming from differnt origins(rows)

select origin, count(flight_num) from second group by origin;

6. Create a index on the second table

create index fl_index on table flights(flight_num)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' 
WITH DEFERRED REBUILD;

show index on second;

7. Count the total number of flights

select count(flight_num) from flights;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment