Created
July 15, 2019 12:49
-
-
Save lumbric/11c263687ba355e95725146412ec8490 to your computer and use it in GitHub Desktop.
Correlation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Correlation between drowning by falling and swimming in a pool\n", | |
"\n", | |
"Is the correlation between drowning by falling and swimming in a pool surprisingly low?\n", | |
"\n", | |
"https://www.tylervigen.com/spurious-correlations" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 1, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import numpy as np" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Original datasets:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 2, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"swimming = 421, 465, 494, 538, 430, 530, 511, 600, 582, 605, 603\n", | |
"falling = 109, 102, 102, 98, 85, 95, 96, 98, 123, 94, 102" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Assuming a normal distribution, calculate mean and standard deviation:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 3, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"falling_mean = np.mean(falling)\n", | |
"falling_std = np.std(falling, ddof=1)\n", | |
"\n", | |
"swimming_std = np.std(swimming, ddof=1)\n", | |
"swimming_mean = np.mean(swimming)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Note that we can savely assume that `swimming` is normal distributed as sum of two normal distributed variables `falling` and (the unknown) `only_swimming`." | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Using the [formula for the sum of normal distributed random variables](https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables#Independent_random_variables), the standard deviation of `only_swimming` is given by" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 4, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"only_swimming_std = np.sqrt(swimming_std**2 - falling_std**2)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"We now can generate new time series:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 5, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"N = 1e5\n", | |
"only_swimming_new = np.random.normal(swimming_mean - falling_mean,\n", | |
" only_swimming_std,\n", | |
" size=int(N))\n", | |
"\n", | |
"falling_new = np.random.normal(falling_mean,\n", | |
" falling_std,\n", | |
" size=int(N))" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Correlation is 0, otherwise we would have done something wrong:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 6, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[ 1.00000000e+00, -3.92608199e-04],\n", | |
" [-3.92608199e-04, 1.00000000e+00]])" | |
] | |
}, | |
"execution_count": 6, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"\n", | |
"np.corrcoef(falling_new, only_swimming_new)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"In our artificial data set the correlation is not very high:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 7, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[1. , 0.14170122],\n", | |
" [0.14170122, 1. ]])" | |
] | |
}, | |
"execution_count": 7, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.corrcoef(falling_new, only_swimming_new + falling_new)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"This is very similar to the real values:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 8, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"data": { | |
"text/plain": [ | |
"array([[1. , 0.17511449],\n", | |
" [0.17511449, 1. ]])" | |
] | |
}, | |
"execution_count": 8, | |
"metadata": {}, | |
"output_type": "execute_result" | |
} | |
], | |
"source": [ | |
"np.corrcoef(falling, swimming)" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.3" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2 | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment