Skip to content

Instantly share code, notes, and snippets.

View okumin's full-sized avatar

Shohei Okumiya okumin

View GitHub Profile
% diff --ignore-space-change ~/Downloads/hive_metastore.thrift-313 ~/Downloads/hive_metastore.thrift-400
31a32,35
> const byte ACCESSTYPE_NONE = 1;
> const byte ACCESSTYPE_READONLY = 2;
> const byte ACCESSTYPE_WRITEONLY = 4;
> const byte ACCESSTYPE_READWRITE = 8;
43a48,55
> // Key-value store to be used with selected
> // Metastore APIs (create, alter methods).
> // The client can pass environment properties / configs that can be

Hive's TIMESTAMP

Types

INT96

Impala started using INT96 as timestamp, and Hive and Spark also followed Impala for compabibility. This is the discussion in ML, and PARQUET-323 is a related ticket. As for timezone, it has a bit more complicated context.

TIMESTAMP logical type over INT64

@okumin
okumin / main.md
Last active September 3, 2024 03:21
Hive + Iceberg split

Reproduction

I used Hive 4.0.0.

Create a table with a big Parquet file

set tez.grouping.split-count=1;
CREATE TABLE web_sales_parquet STORED AS PARQUET AS SELECT * FROM web_sales;
@okumin
okumin / 1.2.0-2.0.0.md
Last active August 22, 2024 10:09
Hive keyword changes

Summary.

  • Non-reserved
    • Removed: HOLD_DDLTIME, IGNORE, NO_DROP, OFFLINE, PROTECTION, READONLY, REGEXP, RLIKE
    • Added: AUTOCOMMIT, ISOLATION, LEVEL, OFFSET, SNAPSHOT, TRANSACTION, WORK, WRITE
  • Reserved
    • Removed:
    • Added: COMMIT, ONLY, REGEXP, RLIKE, ROLLBACK, START

The list of changed keywords.

How PlanMapper works

Purpose

PlanMapper helps Hive regenerate better query plans using runtime stats. It groups entities which are semantically the same. For example, A RelNode of Calcite to express WHERE id = 1 could be equivalent with a FilterOperator of Hive. A CommonMergeJoinOperator could be linked to a MapJoinOperator converted from the CommonMergeOperator.

Groups generated by PlanMapper express such relationship so that it can propagate the final runtime stats to RelNodes or Operators in each step. https://cwiki.apache.org/confluence/display/Hive/Query+ReExecution

Flow

@okumin
okumin / main.md
Last active October 19, 2023 15:11

Overview

In HIVE-12679, we have been trying to introduce a feature to make IMetaStoreClient pluggable. This document is a summary of the past discussions.

Problem statement

Apache Hive hardcodes the implementation of IMetaStoreClient, assuming it alreays talks to Hive Metastore. 99% of Hive users doesn't have any problems because they use HMS as a data catalog. However, some data platforms and their users use alternaive services as data catalogs.

  • Amazone EMR provides an option to use AWS Glue Data Catalog
  • Treasure Data deploys Apache Hive integrated with their own in-house data catalog
@okumin
okumin / keybase.md
Last active November 18, 2022 13:37

Keybase proof

I hereby claim:

  • I am okumin on github.
  • I am okumin (https://keybase.io/okumin) on keybase.
  • I have a public key ASBk4J-iYG52Dmu-OCTE37-7M9-YuYo6hxordcz7zr1QfQo

To claim this, I am signing this object:

Summary

MySQL

/ persistAsync persist recover
akka-2.3 1547 7964 450
akka-2.4-rc1 1504 9051 390
akka-2.4-batched 702 8806 410
akka-2.4-seq 615 11711 402
def findMofu(id: Int): Future[CacheError | IOError | NotFound, Mofu] = ???
def createMofu(mofu: Mofu): Future[CacheError | IOError | DuplicateError, Mofu] = ???

val result = findMofu(5).recoverWith {
  case NotFound =>
    createMofu(Mofu(5)).recoverWith {
      case DuplicateError => UnknownError
    }
}