Skip to content

Instantly share code, notes, and snippets.

View okumin's full-sized avatar

Shohei Okumiya okumin

View GitHub Profile
@okumin
okumin / main.md
Last active October 19, 2023 15:11

Overview

In HIVE-12679, we have been trying to introduce a feature to make IMetaStoreClient pluggable. This document is a summary of the past discussions.

Problem statement

Apache Hive hardcodes the implementation of IMetaStoreClient, assuming it alreays talks to Hive Metastore. 99% of Hive users doesn't have any problems because they use HMS as a data catalog. However, some data platforms and their users use alternaive services as data catalogs.

  • Amazone EMR provides an option to use AWS Glue Data Catalog
  • Treasure Data deploys Apache Hive integrated with their own in-house data catalog

How PlanMapper works

Purpose

PlanMapper helps Hive regenerate better query plans using runtime stats. It groups entities which are semantically the same. For example, A RelNode of Calcite to express WHERE id = 1 could be equivalent with a FilterOperator of Hive. A CommonMergeJoinOperator could be linked to a MapJoinOperator converted from the CommonMergeOperator.

Groups generated by PlanMapper express such relationship so that it can propagate the final runtime stats to RelNodes or Operators in each step. https://cwiki.apache.org/confluence/display/Hive/Query+ReExecution

Flow

@okumin
okumin / 1.2.0-2.0.0.md
Last active August 22, 2024 10:09
Hive keyword changes

Summary.

  • Non-reserved
    • Removed: HOLD_DDLTIME, IGNORE, NO_DROP, OFFLINE, PROTECTION, READONLY, REGEXP, RLIKE
    • Added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE
  • Reserved
    • Removed:
    • Added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START

The list of changed keywords.

@okumin
okumin / main.md
Last active September 3, 2024 03:21
Hive + Iceberg split

Reproduction

I used Hive 4.0.0.

Create a table with a big Parquet file

set tez.grouping.split-count=1;
CREATE TABLE web_sales_parquet STORED AS PARQUET AS SELECT * FROM web_sales;

Hive's TIMESTAMP

Types

INT96

Impala started using INT96 as timestamp, and Hive and Spark also followed Impala for compabibility. This is the discussion in ML, and PARQUET-323 is a related ticket. As for timezone, it has a bit more complicated context.

TIMESTAMP logical type over INT64

% diff --ignore-space-change ~/Downloads/hive_metastore.thrift-313 ~/Downloads/hive_metastore.thrift-400
31a32,35
> const byte ACCESSTYPE_NONE = 1;
> const byte ACCESSTYPE_READONLY = 2;
> const byte ACCESSTYPE_WRITEONLY = 4;
> const byte ACCESSTYPE_READWRITE = 8;
43a48,55
> // Key-value store to be used with selected
> // Metastore APIs (create, alter methods).
> // The client can pass environment properties / configs that can be

Original Method

  1. HiveMetaStoreClient#createTable: Set the default catalog, set processor capabilities or else, and hook.preCreateTable
  2. SessionHiveMetaStoreClient#create_table: Bail out and create a temporary table if the give table is temporary
  3. HiveMetaStoreClient#create_table: Issue a Thrift request
  4. HiveMetaStoreClient#createTable: hook.CommitCreateTable

New Method

From the outmost wrapper,