Skip to content

Instantly share code, notes, and snippets.

@steven-tey
Last active October 22, 2019 12:25
Show Gist options
  • Save steven-tey/272711390ac35f3fa964b1cab4839f59 to your computer and use it in GitHub Desktop.
Save steven-tey/272711390ac35f3fa964b1cab4839f59 to your computer and use it in GitHub Desktop.
CS146 Session 7.1 Pre-Class Work
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# CS146 Session 7.1 Pre-Class Work\n",
"\n",
"In the reading on <a href=\"https://ropercenter.cornell.edu/support/polling-fundamentals-total-survey-error/\">Total Survey Error</a> by the Roper Center, there is a table of 95% confidence intervals for sampling error in percentage values in survey results. This margin of error depends on both the number of people surveyed (the sampling size) and the observed outcome for a particular candidate (as a percentage). It turns out there is an error in this table.\n",
"\n",
"1. Using the normal approximation to the binomial distribution, confirm that the 95% confidence interval for the sampling error for sample size 1000 and percentage outcome 10% is 2% (rounded to the nearest integer). Motivate why it is appropriate to use the binomial distribution here.\n",
"\n",
"For this question, we will need to refer to the formula for the central limit theorem:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<center>$\\frac{\\bar{X}-\\mu}{\\sigma/\\sqrt{n}}$</center>\n",
"<br>\n",
"As well as the formula for the margin of error:\n",
"<br>\n",
"<center>$Z\\times\\sqrt{\\frac{p(1-p)}{n}}$</center>\n",
"<br>\n",
"Where:\n",
"\n",
"- $p$ = percentage outcome.\n",
"- $n$ = sample size\n",
"- $z$ = z-score"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-1.8598530771521182, 1.8617609969851259]\n"
]
}
],
"source": [
"# Defining the parameters the binomial distribution for the sampling error \n",
"# for sample size 1000 and percentage outcome 10%\n",
"\n",
"n = 1000 # sample size - the amount of people sampled\n",
"p = 0.1 # percentage outcome - number of people who gave a positive response\n",
"\n",
"\n",
"# Using normal distribution to approximate the binomial distribution\n",
"# First, we will have to derive the parameters for the normal distribution from the params for the binom dist.\n",
"mu = n*p # mean/expected outcome of the normal distribution is the sample size multiplied by the % outcome \n",
"stdev = np.sqrt(n*p*(1-p)) # std. deviation is just the square root of the variance\n",
"\n",
"# Now, we will draw 1,000,000 random samples from the normal distribution.\n",
"norm_dist = np.random.normal(mu, stdev, 1000000)\n",
"\n",
"# Here, to calculate the 95% confidence intervals, we will need to first obtain the 95% c.i. for the normal \n",
"# distribution, and then use the margin of error formula as well as the Central Limit Theorem\n",
"# to convert it to the 95% c.i. for the binom. dist.\n",
"ci_95 = [(np.percentile(norm_dist, 2.5)-mu)/n*100, (np.percentile(norm_dist, 97.5)-mu)/n*100]\n",
"\n",
"print(ci_95)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2. Write a Python function for calculating the 95% confidence interval given any sample size and any percentage outcome. Use your function to calculate all the values in the Total Survey Error table rounded to the nearest integer. For which entries does your margin of error differ from the value in the table?"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(100, 0.1, ['-6', '6'])\n",
"(100, 0.2, ['-8', '8'])\n",
"(100, 0.3, ['-9', '9'])\n",
"(100, 0.4, ['-10', '10'])\n",
"(100, 0.5, ['-10', '10'])\n",
"(100, 0.6, ['-10', '10'])\n",
"(100, 0.7, ['-9', '9'])\n",
"(100, 0.8, ['-8', '8'])\n",
"(100, 0.9, ['-6', '6'])\n",
"(250, 0.1, ['-4', '4'])\n",
"(250, 0.2, ['-5', '5'])\n",
"(250, 0.3, ['-6', '6'])\n",
"(250, 0.4, ['-6', '6'])\n",
"(250, 0.5, ['-6', '6'])\n",
"(250, 0.6, ['-6', '6'])\n",
"(250, 0.7, ['-6', '6'])\n",
"(250, 0.8, ['-5', '5'])\n",
"(250, 0.9, ['-4', '4'])\n",
"(500, 0.1, ['-3', '3'])\n",
"(500, 0.2, ['-3', '4'])\n",
"(500, 0.3, ['-4', '4'])\n",
"(500, 0.4, ['-4', '4'])\n",
"(500, 0.5, ['-4', '4'])\n",
"(500, 0.6, ['-4', '4'])\n",
"(500, 0.7, ['-4', '4'])\n",
"(500, 0.8, ['-4', '4'])\n",
"(500, 0.9, ['-3', '3'])\n",
"(750, 0.1, ['-2', '2'])\n",
"(750, 0.2, ['-3', '3'])\n",
"(750, 0.3, ['-3', '3'])\n",
"(750, 0.4, ['-4', '4'])\n",
"(750, 0.5, ['-4', '4'])\n",
"(750, 0.6, ['-3', '4'])\n",
"(750, 0.7, ['-3', '3'])\n",
"(750, 0.8, ['-3', '3'])\n",
"(750, 0.9, ['-2', '2'])\n",
"(1000, 0.1, ['-2', '2'])\n",
"(1000, 0.2, ['-2', '2'])\n",
"(1000, 0.3, ['-3', '3'])\n",
"(1000, 0.4, ['-3', '3'])\n",
"(1000, 0.5, ['-3', '3'])\n",
"(1000, 0.6, ['-3', '3'])\n",
"(1000, 0.7, ['-3', '3'])\n",
"(1000, 0.8, ['-2', '2'])\n",
"(1000, 0.9, ['-2', '2'])\n"
]
}
],
"source": [
"percentage_list = []\n",
"\n",
"def binom_to_norm(n, p):\n",
" mu = n*p \n",
" stdev = np.sqrt(n*p*(1-p))\n",
" norm_dist = np.random.normal(mu, stdev, 1000000)\n",
" ci_95 = [(np.percentile(norm_dist, 2.5)-mu)/n*100, (np.percentile(norm_dist, 97.5)-mu)/n*100]\n",
" percentage_list.append(ci_95[1])\n",
" return(n, p, ['%.0f' % elem for elem in ci_95])\n",
"\n",
"sample_size = [100,250,500,750,1000]\n",
"percentage_outcome = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]\n",
"\n",
"for n in sample_size:\n",
" for p in percentage_outcome:\n",
" print(binom_to_norm(n,p))"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Percentage Outcome</th>\n",
" <th>n = 1000</th>\n",
" <th>n = 750</th>\n",
" <th>n = 500</th>\n",
" <th>n = 250</th>\n",
" <th>n = 100</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0.1</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0.2</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0.3</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0.4</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0.5</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>0.6</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>0.7</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>0.8</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>5.0</td>\n",
" <td>8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>8</th>\n",
" <td>0.9</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>4.0</td>\n",
" <td>6.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Percentage Outcome n = 1000 n = 750 n = 500 n = 250 n = 100\n",
"0 0.1 2.0 2.0 3.0 4.0 6.0\n",
"1 0.2 2.0 3.0 4.0 5.0 8.0\n",
"2 0.3 3.0 3.0 4.0 6.0 9.0\n",
"3 0.4 3.0 4.0 4.0 6.0 10.0\n",
"4 0.5 3.0 4.0 4.0 6.0 10.0\n",
"5 0.6 3.0 4.0 4.0 6.0 10.0\n",
"6 0.7 3.0 3.0 4.0 6.0 9.0\n",
"7 0.8 2.0 3.0 4.0 5.0 8.0\n",
"8 0.9 2.0 2.0 3.0 4.0 6.0"
]
},
"execution_count": 49,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({ 'Percentage Outcome' : np.array([0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]),\n",
" 'n = 1000' : np.array([round(i) for i in percentage_list[-9:]]),\n",
" 'n = 750' : np.array([round(i) for i in percentage_list[27:36]]),\n",
" 'n = 500' : np.array([round(i) for i in percentage_list[18:27]]),\n",
" 'n = 250' : np.array([round(i) for i in percentage_list[9:18]]),\n",
" 'n = 100' : np.array([round(i) for i in percentage_list[0:9]]), })\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"3. Can you identify where these errors come from?\n",
"\n",
"Not exactly sure..."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment