Create binary text file with json files inside. Ideally about 500MB - 1GB size.
CREATE TABLE if not exists json_files
(
`filename` STRING,
`modified_dt` DOUBLE,
`content_size` INT,
`content` STRING
)
PARTITIONED BY (`pr_day` STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0001'
LINES TERMINATED BY '\u0002'
STORED AS TEXTFILE;
example file layout (500MB - 1G):
<filename>\0001<modified_dt>\0001<content_size>\0001<content json>\0002
<filename>\0001<modified_dt>\0001<content_size>\0001<content json>\0002
<filename>\0001<modified_dt>\0001<content_size>\0001<content json>\0002
<filename>\0001<modified_dt>\0001<content_size>\0001<content json>\0002
Tip1: concat.py can stream concated file to stdout and you can use hdfs -put - to stream it to hdfs directly (and gziped)
copy and use example