The following questions are intended to provide information about a warehouse, for use in building a new dbt adapter. Different data warehouses have pretty different semantics, so some of these questions may be more applicable than others. Where possible, please provide as much detail as you can!
Please copy this questionnaire to a GitHub gist, then fill out answers to the questions to the best of your ability. Where you're done, please post a link to the public gist in the relevant issue. If the issue does not already exist, please create it first.
-
What options are there for connecting to the warehouse? Eg: ODBC, a python module, etc
- PyHive from Dropbox on Github or on PyPI
- PyHive is a collection of Python DB-API and SQLAlchemy interfaces for Presto and Hive.
- ... ^ this looks to be the most common and best recommendation across StackOverflow
- PyHive from Dropbox on Github or on PyPI
-
Is there a consensus around which of the options in #1 is the most featureful/best supported/most mature/etc option?
- Yes, PyHive
-
Does the warehouse support namespaces/schemas/datasets (or similar)?
- Yes, Hive supports
create database | schema
- Yes, Hive supports
-
Can schemas/namespaces be created using SQL? Eg: create schema my_schema
- Yes, Hive supports
create database | schema
- Yes, Hive supports
-
Does the warehouse support logical databases?
- Yes? Hive supports
create database | schema
CREATE SCHEMA
Partial support asdatabase
is used as the equivalent- Hive is a collect of files on distributed disk (HDFS), so a database is a logical contruct for organizing those files
- Yes? Hive supports
-
Does the warehouse support standard-ish SQL? Are there any noteworthy caveats?
- Yes, mostly standard with a few cavests
- These are valid queries
SELECT from_columns FROM table WHERE conditions;
SELECT DISTINCT column_name FROM table;
SELECT owner, COUNT(*) FROM table GROUP BY owner;
SELECT pet.name, comment FROM pet JOIN event ON pet.name = event.name;
- Caveats (from Apache Hive 2.3 supported features)
- Empty grouping sets not supported
- PRIMARY KEY constraints are not enforced
- LIKE enhancements only Partial support Escape characters not supported
- ARRAY_AGG Partial support, collect_list provides similar functionality
- Enhanced UNNEST Partial support LATERAL JOIN provides similar functionality
- Advanced OLAP operations Partial support, PERCENT_RANK, CUME_DIST and ROW_NUMBER supported
-
Are transactions supported?
- Partial Transaction support, Autocommit transaction for INSERT/UPDATE/DELETE/MERGE
-
Can tables be created with create table schema.table as (...)?
- Yes,
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
- Yes,
-
Can views be created with create view schema.view as (...)?
- Yes,
CREATE VIEW [IF NOT EXISTS] view_name
- Yes,
-
Can tables/views be renamed with alter table {table_name} rename to {new_name}?
- Yes, reference
-
Does the warehouse support insert statements? Any caveats?
- Yes, reference
-
Does the warehouse support delete statements? Any caveats?
- Yes,
- "To simulate the effects of an UPDATE or DELETE statement in other database systems, typically you use INSERT or CREATE TABLE AS SELECT to copy data from one table to another, filtering out or changing the appropriate rows during the copy operation."
- https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_dml.html
- Yes,
-
Does the warehouse support update statements? Any caveats?
- Yes,
- "To simulate the effects of an UPDATE or DELETE statement in other database systems, typically you use INSERT or CREATE TABLE AS SELECT to copy data from one table to another, filtering out or changing the appropriate rows during the copy operation."
- https://www.cloudera.com/documentation/enterprise/5-8-x/topics/impala_dml.html
- Yes,
-
Does the warehouse support merge statements? Any caveats?
-
Does the warehouse support drop table/drop view statements? Any caveats?
-
Does the warehouse support truncate statements? Any caveats?
-
Does the warehouse support temporary tables? Any caveats?
-
Can queries be cancelled?
-
Are views bound to the relations they select from? Ie. do drop table statements require a ...cascade argument?
-
Does the warehouse support querying for existing relations (ie. their existence and type)? Is this via an API call or a SELECT statement? Eg: select * from information_schema.tables.
-
Does the warehouse support querying for the the columns in a relation? Is this via an API call or a SELECT statement? Eg: select * from information_schema.columns
-
Can columns be added and removed using DDL? Eg: alter table add column ?
-
Does the warehouse support non-standard performance configurations? Ie. clustering, partitioning, sort/dist keys, etc. What are they, and how are they used? Can they be supplied in create table as statements?
-
Which column types does the warehouse support? Are text types varchars (with sizes) or unsized string columns? Are numeric types (with fixed precision) supported? Is there a different type for timestamps with timezones? Any caveats?
-
Does the warehouse support column-level constraints (eg. unique, not null, foreign key, primary key)? Are they enforced? Can they be defined in create table as statements?
-
Are there other noteworthy quirks/features of the warehouse that have not been specified above? If so: please enumerate them here!