Skip to content

Instantly share code, notes, and snippets.

@lschwetlick
Last active September 15, 2021 20:52
Show Gist options
  • Save lschwetlick/76d256e8f8e9c6fa499fd6cdfbe8954c to your computer and use it in GitHub Desktop.
Save lschwetlick/76d256e8f8e9c6fa499fd6cdfbe8954c to your computer and use it in GitHub Desktop.
Multi-Column Explode idea
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "code",
"source": [
"import pandas as pd"
],
"outputs": [],
"execution_count": 65,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:51.572Z",
"iopub.execute_input": "2021-09-15T20:34:51.577Z",
"iopub.status.idle": "2021-09-15T20:34:51.586Z",
"shell.execute_reply": "2021-09-15T20:34:51.592Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"Toy Expample"
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"d = {\"A\":[1,2,3],\n",
" \"B\":[\"a\",\"b\",\"c\"],\n",
" \"C\":[[1,2],[3,4],[5,6]],\n",
" \"D\":[[4,5],[6,7],[8,9]], }\n",
"df = pd.DataFrame.from_dict(d)\n",
"df"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 66,
"data": {
"text/plain": " A B C D\n0 1 a [1, 2] [4, 5]\n1 2 b [3, 4] [6, 7]\n2 3 c [5, 6] [8, 9]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>[1, 2]</td>\n <td>[4, 5]</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>[3, 4]</td>\n <td>[6, 7]</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>[5, 6]</td>\n <td>[8, 9]</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 66,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:51.979Z",
"iopub.execute_input": "2021-09-15T20:34:51.987Z",
"iopub.status.idle": "2021-09-15T20:34:52.001Z",
"shell.execute_reply": "2021-09-15T20:34:52.008Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"Exploding one after the other does not yield the correct result"
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"df.explode(\"C\").explode(\"D\")"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 67,
"data": {
"text/plain": " A B C D\n0 1 a 1 4\n0 1 a 1 5\n0 1 a 2 4\n0 1 a 2 5\n1 2 b 3 6\n1 2 b 3 7\n1 2 b 4 6\n1 2 b 4 7\n2 3 c 5 8\n2 3 c 5 9\n2 3 c 6 8\n2 3 c 6 9",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>1</td>\n <td>5</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>2</td>\n <td>4</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>2</td>\n <td>5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>3</td>\n <td>6</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>3</td>\n <td>7</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>4</td>\n <td>6</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>4</td>\n <td>7</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>5</td>\n <td>8</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>5</td>\n <td>9</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>6</td>\n <td>8</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>6</td>\n <td>9</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 67,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:52.339Z",
"iopub.execute_input": "2021-09-15T20:34:52.346Z",
"iopub.status.idle": "2021-09-15T20:34:52.359Z",
"shell.execute_reply": "2021-09-15T20:34:52.364Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"This code snippet mirrors the solution in the existing explode"
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"def explode_multiple_columns(df, cols_to_explode):\n",
" result = pd.DataFrame({col: df[col].explode() for col in df[cols_to_explode]})\n",
" return (\n",
" df.drop(cols_to_explode, axis=1)\n",
" .join(result)\n",
" )"
],
"outputs": [],
"execution_count": 68,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:52.721Z",
"iopub.execute_input": "2021-09-15T20:34:52.727Z",
"iopub.status.idle": "2021-09-15T20:34:52.737Z",
"shell.execute_reply": "2021-09-15T20:34:52.742Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"And it yields the correct result"
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"cols_to_explode = [\"C\", \"D\"]\n",
"explode_multiple_columns(df, cols_to_explode )"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 69,
"data": {
"text/plain": " A B C D\n0 1 a 1 4\n0 1 a 2 5\n1 2 b 3 6\n1 2 b 4 7\n2 3 c 5 8\n2 3 c 6 9",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>2</td>\n <td>5</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>3</td>\n <td>6</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>4</td>\n <td>7</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>5</td>\n <td>8</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>6</td>\n <td>9</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 69,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:53.103Z",
"iopub.execute_input": "2021-09-15T20:34:53.108Z",
"iopub.status.idle": "2021-09-15T20:34:53.120Z",
"shell.execute_reply": "2021-09-15T20:34:53.125Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"So what if the lists are not the same length?"
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"d = {\"A\":[1,2,3],\n",
" \"B\":[\"a\",\"b\",\"c\"],\n",
" \"C\":[[1,2],[3,4],[5,6]],\n",
" \"D\":[[4,5,9],[6,7],[8]], }\n",
"df = pd.DataFrame.from_dict(d)\n",
"df"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 70,
"data": {
"text/plain": " A B C D\n0 1 a [1, 2] [4, 5, 9]\n1 2 b [3, 4] [6, 7]\n2 3 c [5, 6] [8]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>[1, 2]</td>\n <td>[4, 5, 9]</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>[3, 4]</td>\n <td>[6, 7]</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>[5, 6]</td>\n <td>[8]</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 70,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:53.569Z",
"iopub.execute_input": "2021-09-15T20:34:53.576Z",
"iopub.status.idle": "2021-09-15T20:34:53.588Z",
"shell.execute_reply": "2021-09-15T20:34:53.594Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"We fill them up..."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"def explode_multiple_columns2(df, cols_to_explode):\n",
" exploded_list = []\n",
" for col in df[cols_to_explode]:\n",
" df2 = df[col].explode().reset_index()\n",
" df2[\"ixix\"] = df2.groupby(['index']).cumcount()\n",
" exploded_list.append(df2)\n",
" \n",
" #exploded_list = [df[col].explode().reset_index().reset_index() for col in df[cols_to_explode]]\n",
" from functools import reduce\n",
" # this is a bit ugly because I'm using the default names as a hack...\n",
" result = reduce(lambda x, y: pd.merge_ordered(x, y, on = ['index', \"ixix\"], how=\"outer\"), exploded_list).set_index(\"index\").drop(\"ixix\", axis=1)\n",
" return df.drop(cols_to_explode, axis=1).join(result)"
],
"outputs": [],
"execution_count": 71,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:54.241Z",
"iopub.execute_input": "2021-09-15T20:34:54.246Z",
"iopub.status.idle": "2021-09-15T20:34:54.254Z",
"shell.execute_reply": "2021-09-15T20:34:54.260Z"
}
}
},
{
"cell_type": "code",
"source": [
"explode_multiple_columns2(df, cols_to_explode)"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 72,
"data": {
"text/plain": " A B C D\n0 1 a 1 4\n0 1 a 2 5\n0 1 a NaN 9\n1 2 b 3 6\n1 2 b 4 7\n2 3 c 5 8\n2 3 c 6 NaN",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>1</td>\n <td>4</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>2</td>\n <td>5</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>NaN</td>\n <td>9</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>3</td>\n <td>6</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>4</td>\n <td>7</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>5</td>\n <td>8</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>6</td>\n <td>NaN</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 72,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:54.569Z",
"iopub.execute_input": "2021-09-15T20:34:54.577Z",
"iopub.status.idle": "2021-09-15T20:34:54.589Z",
"shell.execute_reply": "2021-09-15T20:34:54.594Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"Works for 3 Columns..."
],
"metadata": {
"nteract": {
"transient": {
"deleting": false
}
}
}
},
{
"cell_type": "code",
"source": [
"d = {\"A\":[1,2,3],\n",
" \"B\":[\"a\",\"b\",\"c\"],\n",
" \"C\":[[1,2],[3,4],[5,6]],\n",
" \"D\":[[4,5,9],[6,7],[8]],\n",
" \"E\":[[49,59,99],[79],[89, 99]], }\n",
"df = pd.DataFrame.from_dict(d)\n",
"df"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 73,
"data": {
"text/plain": " A B C D E\n0 1 a [1, 2] [4, 5, 9] [49, 59, 99]\n1 2 b [3, 4] [6, 7] [79]\n2 3 c [5, 6] [8] [89, 99]",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n <th>E</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>[1, 2]</td>\n <td>[4, 5, 9]</td>\n <td>[49, 59, 99]</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>[3, 4]</td>\n <td>[6, 7]</td>\n <td>[79]</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>[5, 6]</td>\n <td>[8]</td>\n <td>[89, 99]</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 73,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:55.276Z",
"iopub.execute_input": "2021-09-15T20:34:55.282Z",
"iopub.status.idle": "2021-09-15T20:34:55.293Z",
"shell.execute_reply": "2021-09-15T20:34:55.299Z"
}
}
},
{
"cell_type": "code",
"source": [
"explode_multiple_columns2(df, [\"C\", \"D\", \"E\"])"
],
"outputs": [
{
"output_type": "execute_result",
"execution_count": 74,
"data": {
"text/plain": " A B C D E\n0 1 a 1 4 49\n0 1 a 2 5 59\n0 1 a NaN 9 99\n1 2 b 3 6 79\n1 2 b 4 7 NaN\n2 3 c 5 8 89\n2 3 c 6 NaN 99",
"text/html": "<div>\n<style scoped>\n .dataframe tbody tr th:only-of-type {\n vertical-align: middle;\n }\n\n .dataframe tbody tr th {\n vertical-align: top;\n }\n\n .dataframe thead th {\n text-align: right;\n }\n</style>\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n <th>E</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>1</td>\n <td>4</td>\n <td>49</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>2</td>\n <td>5</td>\n <td>59</td>\n </tr>\n <tr>\n <th>0</th>\n <td>1</td>\n <td>a</td>\n <td>NaN</td>\n <td>9</td>\n <td>99</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>3</td>\n <td>6</td>\n <td>79</td>\n </tr>\n <tr>\n <th>1</th>\n <td>2</td>\n <td>b</td>\n <td>4</td>\n <td>7</td>\n <td>NaN</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>5</td>\n <td>8</td>\n <td>89</td>\n </tr>\n <tr>\n <th>2</th>\n <td>3</td>\n <td>c</td>\n <td>6</td>\n <td>NaN</td>\n <td>99</td>\n </tr>\n </tbody>\n</table>\n</div>"
},
"metadata": {}
}
],
"execution_count": 74,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:55.643Z",
"iopub.execute_input": "2021-09-15T20:34:55.648Z",
"iopub.status.idle": "2021-09-15T20:34:55.661Z",
"shell.execute_reply": "2021-09-15T20:34:55.667Z"
}
}
},
{
"cell_type": "code",
"source": [
" "
],
"outputs": [],
"execution_count": 74,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
},
"execution": {
"iopub.status.busy": "2021-09-15T20:34:56.023Z",
"iopub.execute_input": "2021-09-15T20:34:56.030Z",
"iopub.status.idle": "2021-09-15T20:34:56.038Z",
"shell.execute_reply": "2021-09-15T20:34:56.043Z"
}
}
},
{
"cell_type": "code",
"source": [],
"outputs": [],
"execution_count": null,
"metadata": {
"collapsed": true,
"jupyter": {
"source_hidden": false,
"outputs_hidden": false
},
"nteract": {
"transient": {
"deleting": false
}
}
}
}
],
"metadata": {
"kernel_info": {
"name": "scenewalk"
},
"language_info": {
"name": "python",
"version": "3.8.0",
"mimetype": "text/x-python",
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"pygments_lexer": "ipython3",
"nbconvert_exporter": "python",
"file_extension": ".py"
},
"kernelspec": {
"argv": [
"/Users/lisa/Documents/virtual_envs/scenewalk/bin/python3",
"-m",
"ipykernel_launcher",
"-f",
"{connection_file}"
],
"display_name": "scenewalk",
"language": "python",
"name": "scenewalk"
},
"nteract": {
"version": "0.28.0"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
@KKarc
Copy link

KKarc commented Sep 9, 2021

Lists with different length don't seem to explode and fill up NaN correctly
explode_df

@lschwetlick
Copy link
Author

lschwetlick commented Sep 15, 2021

Damn, you're right! I can't figure out how to make it work without a loop then though...
I updated the notebook so that it does the thing now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment