I used Hive 4.0.0.
set tez.grouping.split-count=1;
CREATE TABLE web_sales_parquet STORED AS PARQUET AS SELECT * FROM web_sales;
$ hdfs dfs -ls -h /user/hive/warehouse/web_sales_parquet
Found 1 items
-rw-r--r-- 3 zookage hive 1.1 G 2024-09-02 12:29 /user/hive/warehouse/web_sales_parquet/000000_0
The file was split into multiple InputSplits expectedly.
0: jdbc:hive2://hive-hiveserver2:10000/defaul> SELECT * FROM web_sales_parquet WHERE RAND() = 0.0;
...
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 9 9 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 27.69 s
----------------------------------------------------------------------------------------------
Created manifest files, keeping the big Parquet file.
0: jdbc:hive2://hive-hiveserver2:10000/defaul> ALTER TABLE web_sales_parquet SET TBLPROPERTIES ('storage_handler'='org.apache.iceberg.mr.hive.HiveIcebergStorageHandler', 'format-version' = '2');
...
$ hdfs dfs -ls -h /user/hive/warehouse/web_sales_parquet
Found 2 items
-rw-r--r-- 3 zookage hive 1.1 G 2024-09-02 12:29 /user/hive/warehouse/web_sales_parquet/000000_0
drwxr-xr-x - zookage hive 0 2024-09-02 13:50 /user/hive/warehouse/web_sales_parquet/metadata
The same number of tasks was created.
0: jdbc:hive2://hive-hiveserver2:10000/defaul> SELECT * FROM web_sales_parquet WHERE RAND() = 0.0;
...
----------------------------------------------------------------------------------------------
VERTICES MODE STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED
----------------------------------------------------------------------------------------------
Map 1 .......... container SUCCEEDED 9 9 0 0 0 0
----------------------------------------------------------------------------------------------
VERTICES: 01/01 [==========================>>] 100% ELAPSED TIME: 25.58 s
----------------------------------------------------------------------------------------------
I can provide a 600GB dataset and the SQL to reproduce the problem.
@okumin