Shohei Okumiya okumin

Original Method

HiveMetaStoreClient#createTable: Set the default catalog, set processor capabilities or else, and hook.preCreateTable
SessionHiveMetaStoreClient#create_table: Bail out and create a temporary table if the give table is temporary
HiveMetaStoreClient#create_table: Issue a Thrift request
HiveMetaStoreClient#createTable: hook.CommitCreateTable

New Method

From the outmost wrapper,

Hive's TIMESTAMP

Types

INT96

Impala started using INT96 as timestamp, and Hive and Spark also followed Impala for compabibility. This is the discussion in ML, and PARQUET-323 is a related ticket. As for timezone, it has a bit more complicated context.

TIMESTAMP logical type over INT64

Reproduction

I used Hive 4.0.0.

Create a table with a big Parquet file

set tez.grouping.split-count=1;
CREATE TABLE web_sales_parquet STORED AS PARQUET AS SELECT * FROM web_sales;

Summary.

Non-reserved
- Removed: HOLD_DDLTIME, IGNORE, NO_DROP, OFFLINE, PROTECTION, READONLY, REGEXP, RLIKE
- Added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE
Reserved
- Removed:
- Added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START

The list of changed keywords.

How PlanMapper works

Purpose

PlanMapper helps Hive regenerate better query plans using runtime stats. It groups entities which are semantically the same. For example, A RelNode of Calcite to express WHERE id = 1 could be equivalent with a FilterOperator of Hive. A CommonMergeJoinOperator could be linked to a MapJoinOperator converted from the CommonMergeOperator.

Groups generated by PlanMapper express such relationship so that it can propagate the final runtime stats to RelNodes or Operators in each step. https://cwiki.apache.org/confluence/display/Hive/Query+ReExecution

Flow

Overview

In HIVE-12679, we have been trying to introduce a feature to make IMetaStoreClient pluggable. This document is a summary of the past discussions.

Problem statement

Apache Hive hardcodes the implementation of IMetaStoreClient, assuming it alreays talks to Hive Metastore. 99% of Hive users doesn't have any problems because they use HMS as a data catalog. However, some data platforms and their users use alternaive services as data catalogs.

Amazone EMR provides an option to use AWS Glue Data Catalog
Treasure Data deploys Apache Hive integrated with their own in-house data catalog

Keybase proof

I hereby claim:

I am okumin on github.
I am okumin (https://keybase.io/okumin) on keybase.
I have a public key ASBk4J-iYG52Dmu-OCTE37-7M9-YuYo6hxordcz7zr1QfQo

To claim this, I am signing this object:

検証環境

このrevision。 https://github.com/zookage/zookage/commit/80732dc3c25c623911cab3760f49687702fbf0bb

zookage@client-node-0:~$ beeline
Connecting to jdbc:hive2://hive-hiveserver2:10000/default;password=dummy;user=zookage
Connected to: Apache Hive (version 3.1.2)
Driver: Hive JDBC (version 3.1.2)

Summary

MySQL

/	persistAsync	persist	recover
akka-2.3	1547	7964	450
akka-2.4-rc1	1504	9051	390
akka-2.4-batched	702	8806	410
akka-2.4-seq	615	11711	402

	% diff --ignore-space-change ~/Downloads/hive_metastore.thrift-313 ~/Downloads/hive_metastore.thrift-400
	31a32,35
	> const byte ACCESSTYPE_NONE = 1;
	> const byte ACCESSTYPE_READONLY = 2;
	> const byte ACCESSTYPE_WRITEONLY = 4;
	> const byte ACCESSTYPE_READWRITE = 8;
	43a48,55
	> // Key-value store to be used with selected
	> // Metastore APIs (create, alter methods).
	> // The client can pass environment properties / configs that can be