Skip to content

Instantly share code, notes, and snippets.

@HabibMrad
Forked from Y4suyuki/lecture11.md
Created June 5, 2018 16:15
Show Gist options
  • Save HabibMrad/c481d2ecb1aafeea46b64b1e102d6f81 to your computer and use it in GitHub Desktop.
Save HabibMrad/c481d2ecb1aafeea46b64b1e102d6f81 to your computer and use it in GitHub Desktop.
Lecture note of Mathematical Biostatistics Boot Camp 1 by Brian Caffo, PhD at Coursera https://www.coursera.org/course/biostats

Plotting Data

Histgram

Histogram display a sample estimate of the density or mass function by plotting a bar graph of the frequency or proportion of times that a variable takes specific values, or a range of values for continuous data, within a sample

Pros and Cons

  • Histograms are useful and easy, apply to continuous, discrete and even unordered data
  • They use a lot of ink and space to display very little information
  • It's difficult to display several at the same time for comparisons Also, for this data it's probably preferable to consider log base 10, since the raw histogram simplay says that most islands are small

Stem and leaf plot

  • Stem-and-leaf plots are extremely useful for gettihng distribution information on the fly
  • Read the text about creating them
  • They display the complete data set and so waste very little ink
  • Two data sets' stem and leaf plots can be shown back-to-back for comparisons
  • Created by John Tukey, a leading figure in the development of the statistical science and signal processing

Dotchart

  • Dotcharts simply display a data set, one point per dot
  • Ordering of the dots and labeling of the axes can the display additional information
  • Dotcharts show a complete data set and so have high data density
  • May be impossible to sonstruct/difficult to interpret for data sets with lots of points

Dotplots comparing grouped data

  • For data sets in groups, you often want to display density information by group
  • If the size of the data permits, it displaying the whole data is preferable
  • Add horizontal lines to depict means, medians
  • Add vertical lines to depict variation, show conficence intervals interquantile ranges
  • Jitter the points to avoid overplotting (jitter)

Example

  • The InsectSprays dataset contains counts of insect deaths by insecticide type
    (A,B,C,D,E,F)
  • You can obtain the data set with the command
    data(InsectSprays)
attach(InsectSprays)
plot(c(.5, 6.5), range(count))
sprayTypes <- unique(spray)
for (i in 1 : length(sprayTypes)) {
  y <- count[spray == sprayTypes[i]]
  x <- sum(spray == sprayTypes[i])
  points(jitter(rep(i, n), amount = .1), y)
  lines(i + c(.12, .28), rep(mean(y), 2), lwd = 3)
  lines(rep(i + .2, 2),
        mean(y) + c(-1.96, 1.96) * sd(y) / sqrt(n)
        )
}

Boxplots

  • Boxplots are useful for the same sort of display as the dot chart but in instances where displaying the whole data set is not possible
  • Centerline of the boxes represents the median while the box edges correspond to the quantiles
  • Whiskers extend out to a constant times the IQR or the max value
  • Sometimes potential outlieers are denoted by points beyound the whiskers
  • Skewness indicated by centerline being near one of the box edges

#The Jackknife

  • The jackknife is a tool for estimating standard errors and the bias of estimators
  • As its name suggests, the jackknife is a small, handy tool; in contrast to the bootstrap, which is then the moral equivalent of a giant workshop full of tools
  • Both the jackknife and the bootstrap involve resampling data; that is, repeatedly creating new data sets from the original data
  • The jacknife deletes each observation and calculates an estimate based on the remaining n - 1 of them
  • It uses this collection of estimates to do things like estimate the bias and the standard error
  • Note that estimating the bias and having a standard error are not needed for things like sample means, which we know are unbiased estimates of population means and what their standard errors are
  • We'll consider the jackknife for univariate data
  • Let X1,...,Xn be a collection of data used to estimate a parameter θ
  • Let be the estimate based on the full data set
  • Let be the estimate of θ obtained by deleting observation i
  • Let
  • Then, the jackknife estimate of the bias is

    (how far the average delete-one estimate is from the actual estimate)
  • The jackknife estimate of the standard error is

    (the deviance of the delete-one estimates from the average delete-one estimate)

Example

  • Consider the data set of 630 measurements of gray matter volumes for workers from a lead manufacturing plant
  • The median gray matter volumes is around 589 cubic centimeters
  • We want to estimate the bias and standard error of the median

The gist of code

n <- length(gmVol)
theta <- median(gmVol)
jk <- sapply(1 : n,
             function(i) median(gmVol[-i])
             )
thetaBar <- mean(jk)
biasEst <- (n-1) * (thetaBar - theta)
seEst <- sqrt((n-1) * mean((jk - thetaBar)^2))

or , using the bootstrap package

libarry(bootstrap)
out <- jackknife(gmVol, median)
out$jack.se
out$jack.bias
  • Both methods (of course) yield an estimated bias of 0 and a set of 9.94
  • Odd little fact: the jackknife estimate of the bias for the median is always 0 when the number of observation is even
  • It has been shown that the jackknife is a linear approximation to the bootstrap
  • Generally do not use the jackknife for sample quantiles like the median

#Expected values

  • The expected value or mean of a random variable is the center of its distribution
  • For discrete random variable X with PMF p(x), it is defined as follow
      E[X] = sum(x * p(x))
    where the sum is taken over the possible values of x
  • E[X] represents the center of mass of a collection of locations and weights,{x,p(x)}

##Rules about expected values

  • The expected value is a linear operator
  • If a and b are not random and X and Y are two random variables then
    • E[aX + b] = aE[X] + b
    • E[X + Y] = E[X] + E[Y]
  • In general if g is a function that is not linear,
        E[g(X)]≠g(E[X])
  • For example, in general, E[X**2]≠E[X]**2

###Remarks

  • Therefore, the expected value of the sample mean is the population mean that it's trying to estimate
  • When the expected value of an estimator is what its trying to estimate, we say that the estimator is unbiased

The variance

  • The variance of a random variable is a measure of spread
  • If X is a random variable with mean μ, the variance of X is defined as
      Var(X) = E[(X - μ)**2]
    the expected (squared) distance from the mean
  • Densities with a heigher variance are more spread out than densities with a lower variance
  • Convenient computational form
      Var(X) = E[X**2] - E[X]**2
  • If a is constant then Var(aX) = a**2 Var(X)
  • The square root of the variance is called the standard deviation
  • The standard deviation has the same units as X

Interpreting variances

  • Chebyshev's inequality is useful for interpreting variances
  • This inequality stats that
  • For example, the probability that a random variable lies beyond k standard deviations from its mean is less than 1/k**2
      2σ → 25%
      3σ → 11%
      4σ → 6%
  • Note this is only a bound; the actual probability might be quite a bit smaller

Proof of Chebyshev's inequality

Random vectors

  • Random vectors are siply random variables collected into a vector
    • For example if X and Y are random variable (X, Y) is a random vector
  • Joint density f(x,y) satisfies f > 0 and ∫∫f(x,y)dxdy=1
  • For discrete random variables ∑∑f(x,y) = 1
  • In this lecture we focus on independent random variables where f(x,y) = f(x)g(y)

Independent events

  • Two events A and B are independent if
  • Two random variables, X and Y are independent if for any two sets A and B
      P([X∈A]∩[Y∈B]) = P(X∈A)P(Y∈B)
  • If A is independent of B then
    • Ac is independent of B
    • A is independent of Bc
    • Ac is independent of Bc

Useful fact

  • We will use the following fact extensively in this class:
    If a collection of random variables X1, X2, ..., Xn are independent, then their joint distribution is the product of their individual densities or mass functions
    That is, if fi is the density for random variable Xi we have that

IID random variables

  • In the instance where f1 = f2 = ... = fn we say that the Xi are iid for independent and identically distributed
  • iid random variables are the default model for random samples
  • Many of the important theories of statistics are founded on assuming that varibles are iid

Example

  • Suppose that we flip a biased coin with success probability p n times, what is the join density of the collection of outcome?
  • These random variables are iid with densities
  • Therefore

Correlation

  • The covariance between two random variables X and Y is defined as
  • The following are useful facts about covariance
    1. Cov(X, Y) = Cov(Y, X)
    2. Cov(X, Y) can be negative or positive
  • The correlation between X and Y is

    1. Cor(X, Y) = ±1 if and only if X = a + bY for some constants a and b
    2. Cor(X, Y) is unitless
    3. X and Y are uncorrelated if Cor(X, Y) = 0
    4. X and Y are more positively correlated, the closer Cor(X, Y) is to 1
    5. X and Y are more negatively correlated, the closer Cor(X, Y) is to -1

Some useful results

  • Let {Xi} be a collection of random variables
    • When the {Xi} are uncorrelated
    • Otherwise
    • If the Xi are iid with variance σ2 then and

Conditional Probabilities and Densities

Cinditional probability, motivation

  • The probability of a one when rolling a (standard) die is usually assumed to be one sixth
  • Suppose you were givne the extra information that the die roll was an odd number (hence 1, 3, 5)
  • conditional on this new information, the probability of a one is now one third

Conditional probability, definition

  • Let B be an event so that P(B) > 0
  • Then the conditional probability of an event A given that B has occurred is
  • Notice that if A and B are independent, then

Example

  • Consider our die roll example
  • B = {1, 3, 5}
  • A = {1}

Conditional densities and mass functions

  • Conditional densities or mass functions of one variable conditional on the value of another
  • Let f(x,y)be a bivariate density or mass function for random variables X and Y
  • Let f(x) and f(y) be the associated marginal mass function or densities disregarding the other variables
  • Then the conditional density or mass funciton given that Y = y is given by

The Bernoulli distribution

  • The Bernoulli distribution arises the result of binary outcome
  • Bernoulli random variables take (only) the values 1 and 0 with a probabilities of (say) p and 1 - p respectively
  • The PMF for a Bernoulli random variable X is

+ The mean of a Bernoulli random variable is p and the variance is p(1-p) + If we let X be a Bernoulli random variable, it is typical to call X = 1 as a "success" and X = 0 as a "failure"

iid Bernoulli trials

  • If several iid Bernoulli observations, say x1,...,xn are observed the likelihood is

+ Notice that the likelihood depends only on the sum of the xi + Because n is fixed and assumed known, this implies that the sameple proportion

contains all of the relevant information about p + We can maximize the Bernoulli likelihood over p to obtain that

Binomial trials

  • The binomial random variables are obtained as the sum of iid Bernoulli trials
  • In specific, let X1,...,Xn be iid Bernoulli(p); then is a binomial random variable
  • The binomial mass function is

    for x = 0,...,n
  • Recall that the notation

    (read "n choose x") counts the number of ways of selecting x items out of n without replacement disregarding the order of items

Examples

  • Suppose a friend has 8 children, 7 of which are girls and none are twins
  • If each gender has an independent 50% of probability for each birth, what's the probability of getting 7 or more girls out of 8 births?
+ This calculation is an examples of a P value - the probability under a null hypothesis of getting a result as extreme or more extreme than the one actually obtained

The normal distribution

  • A random variable is said to follow a normal or Gaussian distribution with mean μ and variance σ2 if the associated density is

    If X a RV with this density then E[X] = μ and Var(X) = σ2
  • We write X ~ N(μ, σ2)
  • When μ = 0 and σ = 1 the resulting distribution is called the standard normal distribution
  • The standard normal density function is labeled φ
  • Standard normal RVs are often labeled Z

Facts about the normal density

  • If X ~ N(μ, σ2) the
  • If Z is standard normal
+ The non-standard normal density is

More facts about the normal density

  • Approximately 68%, 95% and 99% of the normal density lies within 1, 2 and 3 standard deviations from the mean, respectively
  • -1.28, -1.645, -1.96 and -2.33 are the 10th, 5th, 2.5th and 1st percentiles of the standard normal distribution respectively
  • By symmetry, 1.28, 1.645, 1.96 and 2.33 are the 90th, 95th, 97.5th and 99th percentiles of the standard normal distribution respectively

Question

  • What is the 95th percentile of a N(μ, σ2) distribution?
    • We want the point x0 so that P(X<x0) = .95
    • Therefore
    • In general x0 = μ + σz0 where z0 is the appropriate standard normal quantile
  • What is the probability that a N(μ, σ2) RV is 2 standard deviations above the mean?
    • We want to know
  • If Xi are iid N(μ, σ2), with known variance what's the ML extimate of μ?

Other properties

  • The normal distribution is symmetric and peaked about its mean (therefore the mean, median and mode are all equal)
  • A constant times a normally distributed random variable is also normally distributed (what is the mean and variance?)
  • Sums of normally distributed random variables are again normally distributed even if the variables are dependent (what is the mean and variance?)
  • Sample means of normally distributed random variables are again normally distributed (with what mean and variance?)
  • The square of a standard normal random variable follows what is called chi-squared distribution
  • The exponent of a normally distributed random variables follows what is called the log-normal distribution
  • As we eill see later, many random variables, properly normalized, limit to a normal distribution

The Law of Large Numbers

  • Establishing that a random sequence converges to a limit is hard
  • Fortunatly, we have a theorem that does all the work for us, called the Law of Large Numbers
  • The loa of large numbers states that if X1,...Xn are iid from a population with mean μ and variance σ2 then Xn converges in probability to μ
  • (there are many variation on the LLN; we are using a particularly lazy one)

Useful facts

  • Functions of convergent random sequences converge to the funciton evaluated at the limit
  • This includes sums, products, differences,...
  • Example (Xn)2 converges to μ2

Confidence intervals

Confidence intervals

  • Previously, we discussed creating a confidence interval using the CLT
  • Now we discuss the creation of better confidence intervals for small samples using Gosset's t distribution
  • To discuss the t distribution we must discuss the Chi-squared distribution
  • Throughout we use the following general procedure for creating CIs
    • Create a Pivot or statistic that does not depend on the parameter of interest
    • Solve the probability that the pivot lies between bounds for the parameter

Thi Chi-squared distribution

  • Suppose that S2 is the sample variance from a collection of iid N(μ, σ2) data:
    then

    which reads: follows a Chi-squared distribution with n - 1 digrees of freedom
  • The Chi-squared distribution is skewed and has support on 0 to ∞
  • The mean of the Chi-squared is its degrees of freedom
  • The variance of the Chi-squared distribution is twice the degree of freedom

Confidence interval for the variance

  • Note that is the α quantile of the Chi-squared distribution then

    So that

    is a 100(1-α)% confidence interval for σ2

Note about this interval

  • This interval relies heavily on the assumed normality
  • Square-rooting the endpoints yields a CI for σ
  • It turns out that

    which reads: follows a gamma distribution with shape(n-1)/2 and scale 2σ2
  • Therefore, this can be used to plot a likelihood function for σ2

Example

  • A recent study 513 of organo-lead manufacturing workers reported an average total brain volume of 1,150.315cm3 with a standard deviation of 105.977.
    Assuming normality of the underlying measurements, calculate a confidence interval for the population variation in total brain volume.
#CI for the variance
S2 <- 105.977^2
n <- 513
alpha <- .05
qtiles <- qchisq(c(alpha/2, 1 - alpha/2), n-1)
ival <- rev((n-1) * s2 / qtiles)
#interval for the sd
sqrt(ival)
[1]  99.86484 112.98216
http://meyerweb.com/eric/tools/dencoder/
ftp://ftp.ams.org/pub/tex/doc/amsmath/short-math-guide.pdf
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment