Last active
January 18, 2016 21:00
-
-
Save lgautier/a76839ba9914863c368d to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# dplyr in Python\n", | |
"Note: a more recent version of this is in the rpy2 documentation.\n", | |
"\n", | |
"We need 2 things for this:\n", | |
"\n", | |
"1- A data frame (using one of R's demo datasets)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from rpy2.robjects.packages import importr, data\n", | |
"datasets = importr('datasets')\n", | |
"mtcars_env = data(datasets).fetch('mtcars')\n", | |
"mtcars = mtcars_env['mtcars']" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"2- dplyr" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"from rpy2.robjects.lib.dplyr import DataFrame" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"With this we have the choice of chaining (D3-style)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Source: local data frame [2 x 2]\n", | |
"\n", | |
" gear mean_ptw\n", | |
"1 4 1237.127\n", | |
"2 5 2574.033\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"dataf = (DataFrame(mtcars).\n", | |
" filter('gear>3').\n", | |
" mutate(powertoweight='hp*36/wt').\n", | |
" group_by('gear').\n", | |
" summarize(mean_ptw='mean(powertoweight)'))\n", | |
"\n", | |
"print(dataf)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"or piping (magrittr style)." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Source: local data frame [2 x 2]\n", | |
"\n", | |
" gear mean_ptw\n", | |
"1 4 1237.127\n", | |
"2 5 2574.033\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"from rpy2.robjects.lib.dplyr import (filter,\n", | |
" mutate,\n", | |
" group_by,\n", | |
" summarize)\n", | |
"\n", | |
"dataf = (DataFrame(mtcars) >>\n", | |
" filter('gear>3') >>\n", | |
" mutate(powertoweight='hp*36/wt') >>\n", | |
" group_by('gear') >>\n", | |
" summarize(mean_ptw='mean(powertoweight)'))\n", | |
"\n", | |
"print(dataf)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The strings passed to the dplyr function are evaluated as expression,\n", | |
"just like this is happening when using dplyr in R. This means that\n", | |
"when writing `mean(powertoweight)` the R function `mean()` is used.\n", | |
"\n", | |
"Using an Python function is not too difficult though. We can just\n", | |
"call Python back from R:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Source: local data frame [2 x 3]\n", | |
"\n", | |
" gear mean_np_ptw mean_ptw\n", | |
"1 4 1237.127 1237.127\n", | |
"2 5 2574.033 2574.033\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"from rpy2.rinterface import rternalize\n", | |
"@rternalize\n", | |
"def mean_np(x):\n", | |
" import numpy\n", | |
" return numpy.mean(x)\n", | |
"\n", | |
"from rpy2.robjects import globalenv\n", | |
"globalenv['mean_np'] = mean_np\n", | |
"\n", | |
"dataf = (DataFrame(mtcars) >>\n", | |
" filter('gear>3') >>\n", | |
" mutate(powertoweight='hp*36/wt') >>\n", | |
" group_by('gear') >>\n", | |
" summarize(mean_ptw='mean(powertoweight)',\n", | |
" mean_np_ptw='mean_np(powertoweight)'))\n", | |
"\n", | |
"print(dataf)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"It is also possible to carry this out without having to\n", | |
"place the custom function in R's global environment." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": { | |
"collapsed": true | |
}, | |
"outputs": [], | |
"source": [ | |
"del(globalenv['mean_np'])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Source: local data frame [2 x 3]\n", | |
"\n", | |
" gear mean_np_ptw mean_ptw\n", | |
"1 4 1237.127 1237.127\n", | |
"2 5 2574.033 2574.033\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"from rpy2.robjects.lib.dplyr import StringInEnv\n", | |
"from rpy2.robjects import Environment\n", | |
"my_env = Environment()\n", | |
"my_env['mean_np'] = mean_np\n", | |
"\n", | |
"dataf = (DataFrame(mtcars) >>\n", | |
" filter('gear>3') >>\n", | |
" mutate(powertoweight='hp*36/wt') >>\n", | |
" group_by('gear') >>\n", | |
" summarize(mean_ptw='mean(powertoweight)',\n", | |
" mean_np_ptw=StringInEnv('mean_np(powertoweight)',\n", | |
" my_env)))\n", | |
"\n", | |
"print(dataf)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"**note**: rpy2's interface to dplyr is implementing a fix to the (non-?)issue 1323\n", | |
"(https://github.com/hadley/dplyr/issues/1323)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"The seamless translation of transformations to SQL whenever the\n", | |
"data are in a table can be used directly. Since we are lifting\n", | |
"the original implementation of `dplyr`, it *just works*." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Source: sqlite 3.8.6 [/tmp/tmpwd488fg1]\n", | |
"From: <derived table> [?? x 2]\n", | |
"\n", | |
" gear mean_ptw\n", | |
"1 4 1237.127\n", | |
"2 5 2574.033\n", | |
".. ... ...\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"from rpy2.robjects.lib.dplyr import dplyr\n", | |
"# in-memory SQLite database broken in dplyr's src_sqlite\n", | |
"# db = dplyr.src_sqlite(\":memory:\")\n", | |
"import tempfile\n", | |
"with tempfile.NamedTemporaryFile() as db_fh:\n", | |
" db = dplyr.src_sqlite(db_fh.name)\n", | |
" # copy the table to that database\n", | |
" dataf_db = DataFrame(mtcars).copy_to(db, name=\"mtcars\")\n", | |
" res = (dataf_db >>\n", | |
" filter('gear>3') >>\n", | |
" mutate(powertoweight='hp*36/wt') >>\n", | |
" group_by('gear') >>\n", | |
" summarize(mean_ptw='mean(powertoweight)'))\n", | |
" print(res)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Since we are manipulating R objects, anything available to R is also available\n", | |
"to us. If we want to see the SQL code generated that's:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 9, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<SQL> SELECT \"gear\", \"mean_ptw\"\n", | |
"FROM (SELECT \"gear\", AVG(\"powertoweight\") AS \"mean_ptw\"\n", | |
"FROM (SELECT \"mpg\", \"cyl\", \"disp\", \"hp\", \"drat\", \"wt\", \"qsec\", \"vs\", \"am\", \"gear\", \"carb\", \"hp\" * 36.0 / \"wt\" AS \"powertoweight\"\n", | |
"FROM \"mtcars\"\n", | |
"WHERE \"gear\" > 3.0) AS \"_W1\"\n", | |
"GROUP BY \"gear\") AS \"_W2\"\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"print(res.rx2(\"query\")[\"sql\"])" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"And if the starting point is a pandas data frame,\n", | |
"do the following and start over again." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n" | |
] | |
} | |
], | |
"source": [ | |
"from rpy2.robjects import pandas2ri\n", | |
"from rpy2.robjects import default_converter\n", | |
"from rpy2.robjects.conversion import localconverter\n", | |
"with localconverter(default_converter + pandas2ri.converter) as cv:\n", | |
" mtcars = mtcars_env['mtcars']\n", | |
" mtcars = pandas2ri.ri2py(mtcars)\n", | |
"print(type(mtcars))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using a local converter let's us also go from the pandas data frame to our dplyr-augmented R data frame." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 11, | |
"metadata": { | |
"collapsed": false | |
}, | |
"outputs": [ | |
{ | |
"name": "stdout", | |
"output_type": "stream", | |
"text": [ | |
"Source: local data frame [3 x 2]\n", | |
"\n", | |
" gear mean_ptw\n", | |
"1 3 1633.990\n", | |
"2 4 1237.127\n", | |
"3 5 2574.033\n", | |
"\n" | |
] | |
} | |
], | |
"source": [ | |
"with localconverter(default_converter + pandas2ri.converter) as cv:\n", | |
" dataf = (DataFrame(mtcars).\n", | |
" filter('gear>=3').\n", | |
" mutate(powertoweight='hp*36/wt').\n", | |
" group_by('gear').\n", | |
" summarize(mean_ptw='mean(powertoweight)'))\n", | |
"\n", | |
"print(dataf)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": { | |
"collapsed": false | |
}, | |
"source": [ | |
"**Reuse. Get things done. Don't reimplement.**" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.4.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 0 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment