lumbric · July 15, 2019 12:49
diff --git a/correlation.ipynb b/correlation.ipynb
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Correlation between drowning by falling and swimming in a pool\n",
    "\n",
    "Is the correlation between drowning by falling and swimming in a pool surprisingly low?\n",
    "\n",
    "https://www.tylervigen.com/spurious-correlations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Original datasets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "swimming = 421, 465, 494, 538, 430, 530, 511, 600, 582, 605, 603\n",
    "falling = 109, 102, 102, 98, 85, 95, 96, 98, 123, 94, 102"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Assuming a normal distribution, calculate mean and standard deviation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "falling_mean = np.mean(falling)\n",
    "falling_std = np.std(falling, ddof=1)\n",
    "\n",
    "swimming_std = np.std(swimming, ddof=1)\n",
    "swimming_mean = np.mean(swimming)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we can savely assume that `swimming` is normal distributed as sum of two normal distributed variables `falling` and (the unknown) `only_swimming`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Using the [formula for the sum of normal distributed random variables](https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables#Independent_random_variables), the standard deviation of `only_swimming` is given by"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "only_swimming_std = np.sqrt(swimming_std**2 - falling_std**2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now can generate new time series:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "N = 1e5\n",
    "only_swimming_new = np.random.normal(swimming_mean - falling_mean,\n",
    "                                     only_swimming_std,\n",
    "                                     size=int(N))\n",
    "\n",
    "falling_new = np.random.normal(falling_mean,\n",
    "                               falling_std,\n",
    "                               size=int(N))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Correlation is 0, otherwise we would have done something wrong:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.00000000e+00, -3.92608199e-04],\n",
       "       [-3.92608199e-04,  1.00000000e+00]])"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "\n",
    "np.corrcoef(falling_new, only_swimming_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In our artificial data set the correlation is not very high:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.        , 0.14170122],\n",
       "       [0.14170122, 1.        ]])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.corrcoef(falling_new, only_swimming_new + falling_new)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is very similar to the real values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.        , 0.17511449],\n",
       "       [0.17511449, 1.        ]])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.corrcoef(falling, swimming)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Correlation between drowning by falling and swimming in a pool\n",
	"\n",
	"Is the correlation between drowning by falling and swimming in a pool surprisingly low?\n",
	"\n",
	"https://www.tylervigen.com/spurious-correlations"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {},
	"outputs": [],
	"source": [
	"import numpy as np"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Original datasets:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 2,
	"metadata": {},
	"outputs": [],
	"source": [
	"swimming = 421, 465, 494, 538, 430, 530, 511, 600, 582, 605, 603\n",
	"falling = 109, 102, 102, 98, 85, 95, 96, 98, 123, 94, 102"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Assuming a normal distribution, calculate mean and standard deviation:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 3,
	"metadata": {},
	"outputs": [],
	"source": [
	"falling_mean = np.mean(falling)\n",
	"falling_std = np.std(falling, ddof=1)\n",
	"\n",
	"swimming_std = np.std(swimming, ddof=1)\n",
	"swimming_mean = np.mean(swimming)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note that we can savely assume that `swimming` is normal distributed as sum of two normal distributed variables `falling` and (the unknown) `only_swimming`."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Using the [formula for the sum of normal distributed random variables](https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables#Independent_random_variables), the standard deviation of `only_swimming` is given by"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 4,
	"metadata": {},
	"outputs": [],
	"source": [
	"only_swimming_std = np.sqrt(swimming_std2 - falling_std2)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We now can generate new time series:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {},
	"outputs": [],
	"source": [
	"N = 1e5\n",
	"only_swimming_new = np.random.normal(swimming_mean - falling_mean,\n",
	" only_swimming_std,\n",
	" size=int(N))\n",
	"\n",
	"falling_new = np.random.normal(falling_mean,\n",
	" falling_std,\n",
	" size=int(N))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Correlation is 0, otherwise we would have done something wrong:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 1.00000000e+00, -3.92608199e-04],\n",
	" [-3.92608199e-04, 1.00000000e+00]])"
	]
	},
	"execution_count": 6,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"\n",
	"np.corrcoef(falling_new, only_swimming_new)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In our artificial data set the correlation is not very high:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 7,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[1. , 0.14170122],\n",
	" [0.14170122, 1. ]])"
	]
	},
	"execution_count": 7,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.corrcoef(falling_new, only_swimming_new + falling_new)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This is very similar to the real values:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 8,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[1. , 0.17511449],\n",
	" [0.17511449, 1. ]])"
	]
	},
	"execution_count": 8,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"np.corrcoef(falling, swimming)"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}