Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save adityajn105/1004491c3e5cde543890bada61665717 to your computer and use it in GitHub Desktop.
Save adityajn105/1004491c3e5cde543890bada61665717 to your computer and use it in GitHub Desktop.
Hypothesis_Testing_Demo.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "Hypothesis_Testing_Demo_DataLit_Week_2.ipynb",
"version": "0.3.2",
"provenance": [],
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/adityajn105/1004491c3e5cde543890bada61665717/hypothesis_testing_demo_datalit_week_2.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"metadata": {
"id": "wWNVqI9ImPfj",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"## Hypothesis Testing Demo\n",
"\n",
"### School of AI - DataLit Week 2\n",
"\n",
"#### Any mistakes made here are the sole property of I, Carson Bentley"
]
},
{
"metadata": {
"id": "DwKEURLNkBGB",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"This notebook is based on an article provided by Gaël Varoquaux as part of the [scipy documentation](http://scipy-lectures.org/packages/statistics/index.html#hypothesis-testing-comparing-two-groups)."
]
},
{
"metadata": {
"id": "REyIYsJCLcc3",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"See the following for a description of how the data was collected:\n",
"\n",
"[Brain Size and Intelligence. Willerman et al. (1991)](https://www3.nd.edu/~busiforc/handouts/Data%20and%20Stories/correlation/Brain%20Size/brainsize.html)"
]
},
{
"metadata": {
"id": "_y8aM6Ow2kHV",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"import pandas as pd"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "BmaMZKG12Ej6",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"URL = 'http://scipy-lectures.org/_downloads/brain_size.csv'"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "jw9ml0n52F0F",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"df = pd.read_csv(URL, sep=';', na_values=\".\", index_col=0)"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "0JH1qEG32ruO",
"colab_type": "code",
"outputId": "f751f301-b043-47fa-b916-e81dfc6b277d",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 407
}
},
"cell_type": "code",
"source": [
"df.head(12)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Gender</th>\n",
" <th>FSIQ</th>\n",
" <th>VIQ</th>\n",
" <th>PIQ</th>\n",
" <th>Weight</th>\n",
" <th>Height</th>\n",
" <th>MRI_Count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Female</td>\n",
" <td>133</td>\n",
" <td>132</td>\n",
" <td>124</td>\n",
" <td>118.0</td>\n",
" <td>64.5</td>\n",
" <td>816932</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Male</td>\n",
" <td>140</td>\n",
" <td>150</td>\n",
" <td>124</td>\n",
" <td>NaN</td>\n",
" <td>72.5</td>\n",
" <td>1001121</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Male</td>\n",
" <td>139</td>\n",
" <td>123</td>\n",
" <td>150</td>\n",
" <td>143.0</td>\n",
" <td>73.3</td>\n",
" <td>1038437</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Male</td>\n",
" <td>133</td>\n",
" <td>129</td>\n",
" <td>128</td>\n",
" <td>172.0</td>\n",
" <td>68.8</td>\n",
" <td>965353</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Female</td>\n",
" <td>137</td>\n",
" <td>132</td>\n",
" <td>134</td>\n",
" <td>147.0</td>\n",
" <td>65.0</td>\n",
" <td>951545</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Female</td>\n",
" <td>99</td>\n",
" <td>90</td>\n",
" <td>110</td>\n",
" <td>146.0</td>\n",
" <td>69.0</td>\n",
" <td>928799</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Female</td>\n",
" <td>138</td>\n",
" <td>136</td>\n",
" <td>131</td>\n",
" <td>138.0</td>\n",
" <td>64.5</td>\n",
" <td>991305</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>Female</td>\n",
" <td>92</td>\n",
" <td>90</td>\n",
" <td>98</td>\n",
" <td>175.0</td>\n",
" <td>66.0</td>\n",
" <td>854258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9</th>\n",
" <td>Male</td>\n",
" <td>89</td>\n",
" <td>93</td>\n",
" <td>84</td>\n",
" <td>134.0</td>\n",
" <td>66.3</td>\n",
" <td>904858</td>\n",
" </tr>\n",
" <tr>\n",
" <th>10</th>\n",
" <td>Male</td>\n",
" <td>133</td>\n",
" <td>114</td>\n",
" <td>147</td>\n",
" <td>172.0</td>\n",
" <td>68.8</td>\n",
" <td>955466</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11</th>\n",
" <td>Female</td>\n",
" <td>132</td>\n",
" <td>129</td>\n",
" <td>124</td>\n",
" <td>118.0</td>\n",
" <td>64.5</td>\n",
" <td>833868</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12</th>\n",
" <td>Male</td>\n",
" <td>141</td>\n",
" <td>150</td>\n",
" <td>128</td>\n",
" <td>151.0</td>\n",
" <td>70.0</td>\n",
" <td>1079549</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Gender FSIQ VIQ PIQ Weight Height MRI_Count\n",
"1 Female 133 132 124 118.0 64.5 816932\n",
"2 Male 140 150 124 NaN 72.5 1001121\n",
"3 Male 139 123 150 143.0 73.3 1038437\n",
"4 Male 133 129 128 172.0 68.8 965353\n",
"5 Female 137 132 134 147.0 65.0 951545\n",
"6 Female 99 90 110 146.0 69.0 928799\n",
"7 Female 138 136 131 138.0 64.5 991305\n",
"8 Female 92 90 98 175.0 66.0 854258\n",
"9 Male 89 93 84 134.0 66.3 904858\n",
"10 Male 133 114 147 172.0 68.8 955466\n",
"11 Female 132 129 124 118.0 64.5 833868\n",
"12 Male 141 150 128 151.0 70.0 1079549"
]
},
"metadata": {
"tags": []
},
"execution_count": 23
}
]
},
{
"metadata": {
"id": "vMOb2RmA2sal",
"colab_type": "code",
"outputId": "556de9a0-2cb0-4f17-c31a-8859cead749c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 287
}
},
"cell_type": "code",
"source": [
"df.describe()"
],
"execution_count": 0,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FSIQ</th>\n",
" <th>VIQ</th>\n",
" <th>PIQ</th>\n",
" <th>Weight</th>\n",
" <th>Height</th>\n",
" <th>MRI_Count</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>40.000000</td>\n",
" <td>40.000000</td>\n",
" <td>40.00000</td>\n",
" <td>38.000000</td>\n",
" <td>39.000000</td>\n",
" <td>4.000000e+01</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>113.450000</td>\n",
" <td>112.350000</td>\n",
" <td>111.02500</td>\n",
" <td>151.052632</td>\n",
" <td>68.525641</td>\n",
" <td>9.087550e+05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>24.082071</td>\n",
" <td>23.616107</td>\n",
" <td>22.47105</td>\n",
" <td>23.478509</td>\n",
" <td>3.994649</td>\n",
" <td>7.228205e+04</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>77.000000</td>\n",
" <td>71.000000</td>\n",
" <td>72.00000</td>\n",
" <td>106.000000</td>\n",
" <td>62.000000</td>\n",
" <td>7.906190e+05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>89.750000</td>\n",
" <td>90.000000</td>\n",
" <td>88.25000</td>\n",
" <td>135.250000</td>\n",
" <td>66.000000</td>\n",
" <td>8.559185e+05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>116.500000</td>\n",
" <td>113.000000</td>\n",
" <td>115.00000</td>\n",
" <td>146.500000</td>\n",
" <td>68.000000</td>\n",
" <td>9.053990e+05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>135.500000</td>\n",
" <td>129.750000</td>\n",
" <td>128.00000</td>\n",
" <td>172.000000</td>\n",
" <td>70.500000</td>\n",
" <td>9.500780e+05</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>144.000000</td>\n",
" <td>150.000000</td>\n",
" <td>150.00000</td>\n",
" <td>192.000000</td>\n",
" <td>77.000000</td>\n",
" <td>1.079549e+06</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" FSIQ VIQ PIQ Weight Height MRI_Count\n",
"count 40.000000 40.000000 40.00000 38.000000 39.000000 4.000000e+01\n",
"mean 113.450000 112.350000 111.02500 151.052632 68.525641 9.087550e+05\n",
"std 24.082071 23.616107 22.47105 23.478509 3.994649 7.228205e+04\n",
"min 77.000000 71.000000 72.00000 106.000000 62.000000 7.906190e+05\n",
"25% 89.750000 90.000000 88.25000 135.250000 66.000000 8.559185e+05\n",
"50% 116.500000 113.000000 115.00000 146.500000 68.000000 9.053990e+05\n",
"75% 135.500000 129.750000 128.00000 172.000000 70.500000 9.500780e+05\n",
"max 144.000000 150.000000 150.00000 192.000000 77.000000 1.079549e+06"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"metadata": {
"id": "ZDFRr3ARDYXd",
"colab_type": "code",
"colab": {}
},
"cell_type": "code",
"source": [
"from scipy import stats"
],
"execution_count": 0,
"outputs": []
},
{
"metadata": {
"id": "99VbahxAQDYw",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Check out the source code for scipy.stats [here](https://github.com/scipy/scipy/blob/master/scipy/stats/stats.py). Under the hood, scipy uses numpy for the math calculations."
]
},
{
"metadata": {
"id": "u9GCZu0mQyEO",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### One Sample T-Test"
]
},
{
"metadata": {
"id": "C27UjHZyGqH1",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"I'm curious if the averages given by this sample vary from the standard average IQ, which I happen to know is 100. In this experiment the null hypothesis is that the population from which this sample is drawn is actually 100, and the alternative hypothesis is that it is not.\n",
"\n",
"Let's use 5% as our significance, alpha"
]
},
{
"metadata": {
"id": "xs5pNzQo26_t",
"colab_type": "code",
"outputId": "65eea3bd-4e90-40b3-9fce-11101a571680",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
}
},
"cell_type": "code",
"source": [
"IQ_column_names = ['FSIQ', 'VIQ', 'PIQ']\n",
"\n",
"for IQ_column in IQ_column_names:\n",
" print(stats.ttest_1samp(df[IQ_column], 100))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Ttest_1sampResult(statistic=3.532307014238269, pvalue=0.0010766792736967715)\n",
"Ttest_1sampResult(statistic=3.3074146385401786, pvalue=0.002030117404781822)\n",
"Ttest_1sampResult(statistic=3.1030246997178783, pvalue=0.0035555593418294417)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "1o0O9d_dI8om",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Since the p-value is smaller than alpha, we can confidently reject the null hypothesis. That means that our average IQs of 113, 112, and 111 (for the FSIQ, VIQ, and PIQ) are most likely due to something other than random variation.\n",
"\n",
"Speculating as to why these IQs are above average, I imagined a scenario in which subjects are being gathered for a data collection at a university. Many of these subjects would naturally be students.\n",
"\n",
"As it turns out, this speculation was correct, as you can confirm by looking at the article linked at the top of this notebook."
]
},
{
"metadata": {
"id": "uuU4nxGdQ9NT",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"### Two Sample T-Test"
]
},
{
"metadata": {
"id": "alt_xNJzc5Iu",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Suppose we want to compare the IQs of men and women."
]
},
{
"metadata": {
"id": "BOLNeiN9drMG",
"colab_type": "code",
"outputId": "747f342e-16bf-42f6-a2b6-931014b487ab",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 123
}
},
"cell_type": "code",
"source": [
"groupby_gender = df.groupby('Gender')\n",
"for IQ_column in IQ_column_names:\n",
" for gender, value in groupby_gender[IQ_column]:\n",
" print((gender, value.mean()))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"('Female', 111.9)\n",
"('Male', 115.0)\n",
"('Female', 109.45)\n",
"('Male', 115.25)\n",
"('Female', 110.45)\n",
"('Male', 111.6)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "wwi7RZ4beaCn",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Although the males in this sample have higher average IQs in the three areas tested, these results are not necessarily significant.\n",
"\n",
"Let's do a two sample t-test to see if any of these differences meet a 5% standard alpha."
]
},
{
"metadata": {
"id": "yG3Bfa9nURwC",
"colab_type": "code",
"outputId": "bb7c604c-7235-4f9b-c1f9-5696e42d799a",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 70
}
},
"cell_type": "code",
"source": [
"female = df[df['Gender'] == 'Female']\n",
"male = df[df['Gender'] == 'Male']\n",
"for IQ_column in IQ_column_names:\n",
" print(stats.ttest_ind(female[IQ_column], male[IQ_column]))"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Ttest_indResult(statistic=-0.4026724743703011, pvalue=0.6894456253897778)\n",
"Ttest_indResult(statistic=-0.7726161723275011, pvalue=0.44452876778583217)\n",
"Ttest_indResult(statistic=-0.15980113150762698, pvalue=0.8738841403250049)\n"
],
"name": "stdout"
}
]
},
{
"metadata": {
"id": "WPQ94WdMfWlD",
"colab_type": "text"
},
"cell_type": "markdown",
"source": [
"Since all three p-values are greater than 5%, we fail to reject the null hypothesis in each case. In other words the difference in mean IQ values for the observed data is not statistically significant."
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment