Last active
May 28, 2018 18:09
-
-
Save piroux/265b1a40489f251b6500f163a6f4e534 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Pandas and Series of Bool\n", | |
"\n", | |
"I am not able to find an acceptable way to categorize a Serie of booleans as so if it contains some `None` values." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np\n", | |
"import pandas as pd" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"def random_bool(freq_true, freq_missing=0, missing_value=None):\n", | |
" if np.random.random() > freq_missing:\n", | |
" return np.random.random() > freq_true\n", | |
" else:\n", | |
" return missing_value" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Context\n", | |
"\n", | |
"Initially, we create a serie of booleans, and some rows will be actually filled with missing values on purpose :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sb = pd.Series([random_bool(0.4, 0.2) for _ in range(10)])" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 True\n", | |
"1 True\n", | |
"2 None\n", | |
"3 False\n", | |
"4 None\n", | |
"5 False\n", | |
"6 False\n", | |
"7 None\n", | |
"8 False\n", | |
"9 True\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 4, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sb" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Issue\n", | |
"Pandas seems to assign `object` to the dtype of this serie because the serie contains some `None`. </br>(Please note that this behaviour is not followed with `int` and `float`)\n", | |
"\n", | |
"** I would like to have Pandas findout out by itself that the serie is of \"`dtype: bool`\" instead of \"`dtype: object`\".**\n", | |
"\n", | |
"Indeed, I would like to achieve that **only by looking at the data in the Serie**, hence **without using `astype`** !\n", | |
"\n", | |
"\n", | |
"\n", | |
"### Naive solution\n", | |
"\n", | |
"Until now, the only solution I thought about is to filter the `None` values :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 True\n", | |
"1 True\n", | |
"3 False\n", | |
"5 False\n", | |
"6 False\n", | |
"8 False\n", | |
"9 True\n", | |
"dtype: object" | |
] | |
}, | |
"execution_count": 5, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sb[sb.notna()]" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In truth, it is not sufficient to filter only, because the `dtype` of the serie is not updated during that single step.\n", | |
"\n", | |
"So I have to do _that_ in order to finally reach \"`dtype: bool`\" :" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"sb2 = pd.Series(sb[sb.notna()].tolist())" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"0 True\n", | |
"1 True\n", | |
"2 False\n", | |
"3 False\n", | |
"4 False\n", | |
"5 False\n", | |
"6 True\n", | |
"dtype: bool" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sb2" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"dtype('bool')" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"sb2.dtype" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Which I find **terrible** !\n", | |
"\n", | |
"### Help needed: other solution ?\n", | |
"\n", | |
"Would you have another idea ? :)\n", | |
"\n", | |
"A solution where I do not have to either regenerate a new Serie or generate a list would be perfect !" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 2", | |
"language": "python", | |
"name": "python2" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 2 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython2", | |
"version": "2.7.6" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment