Skip to content

Instantly share code, notes, and snippets.

@cpcloud
Last active August 29, 2015 14:17
Show Gist options
  • Save cpcloud/10fd3fbacbf2a3cebdd8 to your computer and use it in GitHub Desktop.
Save cpcloud/10fd3fbacbf2a3cebdd8 to your computer and use it in GitHub Desktop.
blaze + odo SciPy 2015 abstract

Blaze + Odo: Shapeshifting on fire

Brief Desciption

Blaze separates expressions from computation. Odo moves complex data resources from point A to point B. Together they smooth over many of the complexities of computing with large data warehouse technologies like Redshift, Impala and HDFS. These libraries we designed with PyData in mind and so they play well with pandas, numpy, and a host of other foundational libraries. We show examples of each in action and discuss the design behind each library.

Blaze

Blaze lets us write down abstract expressions and then run those expressions against a data source. This approach lets users separate computation from data so that the details of the data source's API are mostly hidden. Additionally, blaze is pluggable. This lets users easily write backends for blaze. This allows other communities to hook in to the PyData ecosystem. Blaze is also well-integrated with other PyData projects such as numba. We discuss the design of blaze, show off a few backends and show how users can easily hook other data oriented systems into the library.

Odo

Odo is the library behind blaze that handles data movement. It also stands well on its own. Often times we're faced with needing to convert some kind of data into another kind of data. For example, we may want to shove a pandas DataFrame into a Parquet file on a Hive cluster. Another example is reading in a JSON file from an Amazon S3 bucket into a DataFrame. This process consists of multiple steps each of which is easy, but tying them all together in a robust way has traditionally been very cumbersome. Odo does the tying for you. Odo builds up a graph of types connected by conversion functions. This approach allows robust conversion of many disparate types of data without having to write down a path for every conversion that you may want. We discuss the effectiveness of this approach alongside some examples that illustrate how it integrates with blaze.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment