Skip to content

Instantly share code, notes, and snippets.

View KWillets's full-sized avatar

Kendall Willets KWillets

View GitHub Profile

Estimating NDV from Parquet metadata

Data lakes based on Parquet files are increasingly popular, but data often lacks detailed statistics, so estimating them from metadata alone has become an interesting problem. The diversity of data types, encodings, and compression methods in these files offers a number of possibilities, but here we will focus on min/max values.

Parquet files break their rows into rowgroups and record metadata for each. Minimum and maximum values for each column are included to allow data skipping, but they offer some insight into the NDV.

For a Sorted Column

@KWillets
KWillets / divless.cpp
Created August 5, 2019 22:14
Lemire's almost divisionless random int in a range, extended with some fancy 32/64 extensible arithmetic
#include <inttypes.h>
#include <stdio.h>
#include <stdlib.h>
#include <random>
typedef uint32_t (*randfnc32)(void);
/*
To generate a random integer within a range, we start
with a random fractional number (in [0,1)) and multiply it
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#include "streamvbyte_shuffle_tables_decode.h"
void dump(const __m128i x, char * tag) {
printf( "%6s: ", tag);
char * xc = (char *) &x;
for( int i =0; i < 16; i++)
@KWillets
KWillets / jacc.c
Created October 8, 2018 16:15
Jaccard index count benchmarks
#include <stdlib.h>
#include <stdio.h>
#include <x86intrin.h>
#include <inttypes.h>
#include "benchmark.h"
// multiple of 4:
#define N (2048)