Phillip Cloud cpcloud

Blaze + Odo: Shapeshifting on fire

Brief Desciption

Blaze separates expressions from computation. Odo moves complex data resources from point A to point B. Together they smooth over many of the complexities of computing with large data warehouse technologies like Redshift, Impala and HDFS. These libraries we designed with PyData in mind and so they play well with pandas, numpy, and a host of other foundational libraries. We show examples of each in action and discuss the design behind each library.

Blaze

Blaze lets us write down abstract expressions and then run those expressions against a data source. This approach lets users separate computation from data so that the details of the data source's API are mostly hidden. Additionally, blaze is pluggable. This lets users easily write backends for blaze. This allows other communities to hook in to the PyData ecosystem. Blaze is also well-integrated with other PyData projects such as numba. We discuss the design of blaze, show off a few backends and sh

Building `arrow`, `parquet-cpp`, and `pyarrow`

Prerequisites

conda
Boost (>= 1.54)
A recent-ish C/C++ compiler (4.9?)

	{
	"metadata": {
	"name": "",
	"signature": "sha256:1428f1fc4d07eeb621c648b728c7feb5f0210656024476c109e86964cb744ea9"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [

	{
	"metadata": {
	"name": "",
	"signature": "sha256:1428f1fc4d07eeb621c648b728c7feb5f0210656024476c109e86964cb744ea9"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [

	{
	"metadata": {
	"name": "",
	"signature": "sha256:1efaafda434deed4c4582e9da622b8c45fff909cb71c32dbc68a4422b72d3e43"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [

	wget https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tar.xz
	tar xvf Python-2.7.8.tar.xz
	cd Python-2.7.8
	./configure --enable-shared --enable-ipv6 --enable-unicode=ucs4 --prefix=/usr
	make -j `nproc`
	find -name 'readline.so'

	# currently:
	diamonds[(diamonds.cut == 'Ideal') \| (diamonds.cut == 'Premium')][['cut', 'price']].sort('price', ascending=False).head(10)

	# ideally:
	diamonds[diamonds.cut.isin(['Ideal', 'Premium'])][['cut', 'price']].sort('price', ascending=False).head(10)

	#!/usr/bin/env python

	"""
	Dask version of
	https://hdfgroup.org/wp/2015/04/putting-some-spark-into-hdf-eos/
	"""

	from __future__ import print_function, division

	import os

	In [19]: df = pd.DataFrame({'a':[1,2,3],'b':[1.0,None,3.0]}, index=list('abc'))

	In [20]: t = pa.Table.from_pandas(df)

	In [21]: t.column(2).to_pandas()
	Out[21]:
	0 a
	1 b
	2 c
	Name: _index_level_0, dtype: object