Skip to content

Instantly share code, notes, and snippets.

View ianmcook's full-sized avatar

Ian Cook ianmcook

View GitHub Profile
@ianmcook
ianmcook / ibis_create_duckdb_table.py
Created February 14, 2024 16:21
Different ways to create a DuckDB table from Ibis
import pandas as pd
import ibis
# Different ways to create a DuckDB table from Ibis
# ibis.memtable(...): ephemeral, all in-memory, stored as a view inside duckdb, removed when the session ends
# ibis.memtable(...).cache(): ephemeral, stored as temporary table in the duckdb database, removed when the session ends, expression is cached for the lifetime of the session
# con.create_table(..., temp=True): ephemeral, stored as temporary table in the duckdb database, removed when the session ends, expression is NOT cached for the lifetime of the session
# con.create_table(...): persistent, across sessions (assuming you're not using an in-memory connection)
@ianmcook
ianmcook / ibis_spark_pgsql.py
Last active January 30, 2024 21:01
Use Ibis to insert from Spark table into PostgreSQL table
import pandas as pd
import pyarrow as pa
import ibis
from pyspark.sql import SparkSession
# create example data in a pandas DataFrame
df = pd.DataFrame(data={'fruit': ['apple', 'apple', 'apple', 'orange', 'orange', 'orange'],
'variety': ['gala', 'honeycrisp', 'fuji', 'navel', 'valencia', 'cara cara'],
'weight': [134.2 , 158.6, None, 142.1, 96.7, None]})
@ianmcook
ianmcook / acero_tpch_06_decl_seq.cpp
Created January 22, 2024 23:24
Acero Sequence of Declarations for TPC-H Query 06
#include <iostream>
#include <arrow/api.h>
#include <arrow/type.h>
#include <arrow/result.h>
#include <arrow/io/api.h>
#include <arrow/compute/api.h>
#include <arrow/acero/exec_plan.h>
#include <arrow/acero/options.h>
#include <parquet/arrow/reader.h>
@ianmcook
ianmcook / acero_tpch_06_decl.cpp
Created January 22, 2024 23:22
Acero Declarations for TPC-H Query 06
#include <iostream>
#include <arrow/api.h>
#include <arrow/type.h>
#include <arrow/result.h>
#include <arrow/io/api.h>
#include <arrow/compute/api.h>
#include <arrow/acero/exec_plan.h>
#include <arrow/acero/options.h>
#include <parquet/arrow/reader.h>
@ianmcook
ianmcook / acero_tpch_06.cpp
Last active January 22, 2024 23:31
Acero ExecPlan for TPC-H Query 06
#include <iostream>
#include <arrow/api.h>
#include <arrow/type.h>
#include <arrow/result.h>
#include <arrow/io/api.h>
#include <arrow/compute/api.h>
#include <arrow/acero/exec_plan.h>
#include <arrow/acero/options.h>
#include <parquet/arrow/reader.h>
@ianmcook
ianmcook / pyarrow_read_write_order_test.py
Created November 7, 2023 20:04
Write and read Parquet files, combine columns together into an Arrow table, and check if order was preserved
import pyarrow as pa
import pyarrow.parquet as pq
import random
import string
# write parquet files
original = []
for i in range(3):
data = [[random.uniform(0, 1) for _ in range(1000000)]]
original.extend(data)
@ianmcook
ianmcook / 1-write_parquet_float16.cpp
Last active October 13, 2023 18:04
Test writing and reading a Parquet file with a float16 column
#include <iostream>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <arrow/util/float16.h>
#include <parquet/arrow/writer.h>
arrow::Status WriteTableToParquetFile() {
std::shared_ptr<arrow::Array> array;
arrow::HalfFloatBuilder builder;
@ianmcook
ianmcook / write_parquet_float.cpp
Last active October 13, 2023 18:10
Write Parquet file with float32 column
#include <iostream>
#include <random>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/writer.h>
float GetRandomFloat()
{
static std::default_random_engine e;
@ianmcook
ianmcook / write_wide_parquet.cpp
Created October 11, 2023 21:02
Write a very wide Parquet file
#include <iostream>
#include <random>
#include <vector>
#include <string>
#include <arrow/api.h>
#include <arrow/io/api.h>
#include <parquet/arrow/writer.h>
std::vector<std::string> GenerateUniqueStrings() {
// generates 26^4 = 456,976 unique 4-letter combinations
@ianmcook
ianmcook / arrow_is_in.cpp
Created September 19, 2023 15:07
Standalone test of the Arrow C++ `is_in` kernel
#include <iostream>
#include <arrow/api.h>
#include <arrow/compute/api.h>
int main(int, char**) {
// lookup set
std::shared_ptr<arrow::Array> array;
arrow::Int32Builder builder;
if (!builder.Append(5).ok()) return 1;