Skip to content

Instantly share code, notes, and snippets.

@cpcloud
cpcloud / lexer.py
Created January 8, 2018 13:42
Simple Arithmetic Lexer
import collections
import enum
import re
from sre_parse import Pattern, SubPattern, parse
from sre_compile import compile as sre_compile
from sre_constants import BRANCH, SUBPATTERN
class Tokens(enum.Enum):
@cpcloud
cpcloud / decimalz.md
Created November 14, 2017 22:10
Decimals

Decimal Values in SQL-on-Hadoop

This document lays out the ways in which a few prominent SQL-on-Hadoop systems read and write decimal values from and to parquet files, and their respective in-memory formats.

Parquet's logical DECIMAL type can to be represented by the following physical types.

@cpcloud
cpcloud / sparkimalz.py
Last active October 9, 2017 14:42
Sparkimalz
from pyspark.sql import Row
spark.conf.set('spark.sql.parquet.writeLegacyFormat', 'false')
spark.conf.set('spark.sql.parquet.compression.codec', 'uncompressed')
sc = spark.sparkContext
df = spark.createDataFrame(
sc.parallelize(range(1, 100)
).map(lambda i: Row(value=i)))
@cpcloud
cpcloud / restart_docker_impala.sh
Created July 26, 2017 03:12
Run Impala Docker Image
#!/usr/bin/env zsh
export IBIS_TEST_NN_HOST=impalalive
export IBIS_TEST_IMPALA_HOST=$IBIS_TEST_NN_HOST
export IBIS_TEST_IMPALA_PORT=21050
export IBIS_TEST_WEBHDFS_PORT=50070
@cpcloud
cpcloud / foo.patch
Last active July 4, 2017 14:31
Bag
diff --git a/ftplugin/python/slime.vim b/ftplugin/python/slime.vim
index f95e334..6de0b84 100644
--- a/ftplugin/python/slime.vim
+++ b/ftplugin/python/slime.vim
@@ -1,7 +1,7 @@
function! _EscapeText_python(text)
if exists('g:slime_python_ipython') && len(split(a:text,"\n")) > 1
- return ["%cpaste -q\n", a:text, "--\n"]
+ return ["\e[200~", a:text, "\e[201~\n"]
In [19]: df = pd.DataFrame({'a':[1,2,3],'b':[1.0,None,3.0]}, index=list('abc'))
In [20]: t = pa.Table.from_pandas(df)
In [21]: t.column(2).to_pandas()
Out[21]:
0 a
1 b
2 c
Name: _index_level_0, dtype: object
@cpcloud
cpcloud / arrow-build.md
Last active March 13, 2024 20:29
Arrow build instructions

Building arrow, parquet-cpp, and pyarrow

Prerequisites

  • conda
  • Boost (>= 1.54)
  • A recent-ish C/C++ compiler (4.9?)

Create a Conda environment

@cpcloud
cpcloud / daskit.py
Last active August 29, 2015 14:26
Do vs Bag + Do
#!/usr/bin/env python
"""
Dask version of
https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf-eos/
"""
from __future__ import print_function, division
import os
@cpcloud
cpcloud / scipy-fu.md
Last active August 29, 2015 14:17
blaze + odo SciPy 2015 abstract

Blaze + Odo: Shapeshifting on fire

Brief Desciption

Blaze separates expressions from computation. Odo moves complex data resources from point A to point B. Together they smooth over many of the complexities of computing with large data warehouse technologies like Redshift, Impala and HDFS. These libraries we designed with PyData in mind and so they play well with pandas, numpy, and a host of other foundational libraries. We show examples of each in action and discuss the design behind each library.

Blaze

Blaze lets us write down abstract expressions and then run those expressions against a data source. This approach lets users separate computation from data so that the details of the data source's API are mostly hidden. Additionally, blaze is pluggable. This lets users easily write backends for blaze. This allows other communities to hook in to the PyData ecosystem. Blaze is also well-integrated with other PyData projects such as numba. We discuss the design of blaze, show off a few backends and sh

# currently:
diamonds[(diamonds.cut == 'Ideal') | (diamonds.cut == 'Premium')][['cut', 'price']].sort('price', ascending=False).head(10)
# ideally:
diamonds[diamonds.cut.isin(['Ideal', 'Premium'])][['cut', 'price']].sort('price', ascending=False).head(10)