Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save chandinijain/7f0e6f74117819934286e9ae360936b2 to your computer and use it in GitHub Desktop.
Save chandinijain/7f0e6f74117819934286e9ae360936b2 to your computer and use it in GitHub Desktop.
Math
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"# Expected Value and Arithmetic mean\n",
"\n",
"The expected value of a random variable is the probability-weighted average of all possible values.\n",
"When these probabilities are equal, the expected value is the same as arithmetic mean, defined as the sum of the observations divided by the number of observations:\n",
"$$\\mu = \\frac{\\sum_{i=1}^N X_i}{N}$$\n",
"\n",
"where $X_1, X_2, \\ldots , X_N$ are our observations.\n",
"\n",
"For example, if a dice is rolled repeatedly many times, we expect all numbers from 1 - 6 to show up an equal number of times. So the expected value in rolling a six-sided die is 3.5.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean of x1: 54 / 9 = 6.0\n",
"Mean of x2: 154 / 10 = 15.4\n"
]
}
],
"source": [
"from __future__ import print_function\n",
"from auquanToolbox.dataloader import load_data_nologs\n",
"import numpy as np\n",
"import scipy.stats as stats\n",
"\n",
"# Let's say the random variables x1 and x2 have the following values\n",
"x1 = [10,9,8,5,6,7,4,3,2]\n",
"x2 = x1 + [100]\n",
"\n",
"print ('Mean of x1:', sum(x1), '/', len(x1), '=', np.mean(x1))\n",
"print ('Mean of x2:', sum(x2), '/', len(x2), '=', np.mean(x2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When the probabilities of different observations are not equal, i.e a random variable $X$ can take value $X_1$ with probability $p_1$, $X_2$ with probability $p_2$, and so on, the expected value of X is the same as <i>weighted</i> arithmetic mean.\n",
"The weighted arithmetic mean is defined as\n",
"$$\\sum_{i=1}^n p_i X_i $$\n",
"\n",
"where $\\sum_{i=1}^n p_i = 1$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Therefore, the expected value is the average of all values obtained you perform the experiment it represents many times. This follows from the law of large numbers - the average of the results obtained from a large number of repetitions of an experiment should be close to the expected value, and will tend to become closer as more trials are performed.\n",
"\n",
"### Some properties of expected values that are handy:\n",
"* The expected value of a constant is equal to the constant itself $E[c] = c$\n",
"* The expected value is linear, i.e $E[aX+bY] = aE[X]+bE[Y]$ \n",
"* If $X \\leq Y$ , then $E[X] \\leq E[Y]$\n",
"* The expected value not multiplicative, i.e. $E[XY]$ is not necessarily equal to $E[X]E[Y]$. \n",
" The amount by which they differ is called the covariance, covered in a later notebook.\n",
" $Cov(X,Y)=E[XY]-E[X]E[Y]$\n",
" If X and Y are uncorrelated, $Cov(X,Y)=0$\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other measures of centrality that are commonly used are:"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"* Median\n",
"\n",
"Number which appears in the middle of the list when it is sorted in increasing or decreasing order, i.e. the value in $(n+1)/2$ when $n$ is odd and the average of the values in $n/2$ and $(n+2)/2$ positions when $n$ is even. One advantage of using median in describing data compared to the mean is that it is not skewed so much by extremely large or small values\n",
"\n",
"The median us the value that splits the data set in half, but not how much smaller or larger the other values are."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Median of x1: 6.0\n",
"Median of x2: 6.5\n"
]
}
],
"source": [
"print('Median of x1:', np.median(x1))\n",
"print('Median of x2:', np.median(x2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Mode\n",
"\n",
"Most frequently occuring value in a data set. The mode of a probability distribution is the value x at which its probability distribution function takes its maximum value."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"All of the modes of x1: No mode\n"
]
}
],
"source": [
"def mode(l):\n",
" # Count the number of times each element appears in the list\n",
" counts = {}\n",
" for e in l:\n",
" if e in counts:\n",
" counts[e] += 1\n",
" else:\n",
" counts[e] = 1\n",
" \n",
" # Return the elements that appear the most times\n",
" maxcount = 0\n",
" modes = {}\n",
" for key in counts:\n",
" if counts[key] > maxcount:\n",
" maxcount = counts[key]\n",
" modes = {key}\n",
" elif counts[key] == maxcount:\n",
" modes.add(key)\n",
" \n",
" if maxcount > 1 or len(l) == 1:\n",
" return list(modes)\n",
" return 'No mode'\n",
" \n",
"print ('All of the modes of x1:', mode(x1))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Geometric mean\n",
"\n",
"It is the central tendency of a set of numbers by using the product of their values (as opposed to the arithmetic mean which uses their sum). The geometric mean is defined as the nth root of the product of n numbers:\n",
"$$ G = \\sqrt[n]{X_1X_1\\ldots X_n} $$\n",
"\n",
"for observations $X_i \\geq 0$. We can also rewrite it as an arithmetic mean using logarithms:\n",
"$$ \\ln G = \\frac{\\sum_{i=1}^n \\ln X_i}{n} $$\n",
"\n",
"The geometric mean is always less than or equal to the arithmetic mean (when working with nonnegative observations), with equality only when all of the observations are the same."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Geometric mean of x1: 5.35627121246\n",
"Geometric mean of x2: 7.1775512683\n"
]
}
],
"source": [
"# Use scipy's gmean function to compute the geometric mean\n",
"print ('Geometric mean of x1:', stats.gmean(x1))\n",
"print ('Geometric mean of x2:', stats.gmean(x2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If we have stocks returns $R_1, \\ldots, R_T$ over different times, we use the geometric mean to calculate average return $R_G$ so that if the rate of return over the whole time period were constant and equal to $R_G$, the final price of the security would be the same as in the case of returns $R_1, \\ldots, R_T$.\n",
"$$ R_G = \\sqrt[T]{(1 + R_1)\\ldots (1 + R_T)} - 1$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* Harmonic mean\n",
"\n",
"The harmonic mean is less commonly used than the other types of means. It is defined as\n",
"$$ H = \\frac{n}{\\sum_{i=1}^n \\frac{1}{X_i}} $$\n",
"\n",
"As with the geometric mean, we can rewrite the harmonic mean to look like an arithmetic mean. The reciprocal of the harmonic mean is the arithmetic mean of the reciprocals of the observations:\n",
"$$ \\frac{1}{H} = \\frac{\\sum_{i=1}^n \\frac{1}{X_i}}{n} $$\n",
"\n",
"The harmonic mean for nonnegative numbers $X_i$ is always at most the geometric mean (which is at most the arithmetic mean), and they are equal only when all of the observations are equal."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Harmonic mean of x1: 4.66570664472\n",
"Harmonic mean of x2: 5.15738201465\n"
]
}
],
"source": [
"print ('Harmonic mean of x1:', stats.hmean(x1))\n",
"print ('Harmonic mean of x2:', stats.hmean(x2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The harmonic mean can be used when the data can be naturally phrased in terms of ratios. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Variance and Standard Deviation\n",
"\n",
"Variance and Standard Deviation are measures of dispersion of dataset from the mean.\n",
"\n",
"We can define the mean absolute deviation as the average of the distances of observations from the arithmetic mean. We use the absolute value of the deviation, so that 5 above the mean and 5 below the mean both contribute 5, because otherwise the deviations always sum to 0.\n",
"\n",
"$$ MAD = \\frac{\\sum_{i=1}^n |X_i - \\mu|}{n} $$\n",
"\n",
"where $n$ is the number of observations and $\\mu$ is their mean.\n",
"\n",
"Instead of using absolute deviations, we can use the squared deviations, this is called **variance** $\\sigma^2$ : the average of the squared deviations around the mean:\n",
"$$ \\sigma^2 = \\frac{\\sum_{i=1}^n (X_i - \\mu)^2}{n} $$\n",
"\n",
"**Standard deviation** is simply the square root of the variance, $\\sigma$, and it is the easier of the two to interpret because it is in the same units as the observations.\n",
"\n",
"Note that variance is additive while standard deviation is not."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Variance of x1: 6.66666666667\n",
"Standard deviation of x1: 2.58198889747\n",
"Variance of x2: 801.24\n",
"Standard deviation of x2: 28.3061830701\n"
]
}
],
"source": [
"print('Variance of x1:', np.var(x1))\n",
"print('Standard deviation of x1:', np.std(x1))\n",
"print('Variance of x2:', np.var(x2))\n",
"print('Standard deviation of x2:', np.std(x2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Standard deviation indicates the amount of variation in a set of data values. A low standard deviation indicates that the data points tend to be close to the expected value, while a high standard deviation indicates that the data points are spread out over a wider range of values.\n",
" \n",
"### Some properties of standard deviation that are handy:\n",
"\n",
"* The standard deviation of a constant is equal to 0\n",
"* Standard deviations cannot be added. Therefore, $\\sigma(X+Y)\\neq \\sigma(X) + \\sigma(Y)$\n",
"* However, variance, can be added. Infact, $\\sigma^2(X+Y) = \\sigma^2(X) + \\sigma^2(Y) + Cov(X,Y)$\n",
"* If X and Y are uncorrelated, $Cov(X,Y)=0$ and $\\sigma^2(X+Y) = \\sigma^2(X) + \\sigma^2(Y)$\n",
"\n",
"## Volatility\n",
"\n",
"If an experiment is performed daily and the results of an experiment on one day do not affect the on their results any other day, daily observation are uncorrelated. If we measure daily standard deviation as $\\sigma_i$ then we can calculate the standard deviation for an year, also called annualized standard deviation as:\n",
"$$\\sigma_{ann} = \\sqrt{\\sum_{i=1}^T \\sigma_i^2}$$\n",
"\n",
"In finance, we sum over all trading days and this annualized standard deviation is called **Volatility**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# These are Only Estimates\n",
"\n",
"It is important to remember that when we are working with a subset of actuale data, these computations will only give you sample statistics, that is mean and standard deviation of a sample of data. Whether or not this reflects the current true population mean and standard deviation is not always obvious, and more effort has to be put into determining that. This is especially problematic in finance because all data are time series and the mean and variance may change over time. In general do not assume that because something is true of your sample, it will remain true going forward."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.13"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment