ajp619 · April 19, 2014 05:00
diff --git a/poly.Rmd b/poly.Rmd
 ### poly

 How does poly work?
 ```{r}
 a <- 1:10

 # Let's start easy
 p <- poly(a, 3, raw=TRUE)
 p
 # This is easy to reproduce
 data.frame('1'=a, '2'=a^2, '3'=a^3)

 # So what about:
 p <- poly(a, 3, raw=FALSE)  # raw=FALSE is the default option
 # Can I reproduce this?

 # First let's define a couple of functions to make this easier

 # vector length, like the octave norm() function
 o.norm <- function(v){return(sqrt(sum(v*v)))}

 # Normalize
 v.normalize <- function(v){
  v <- v - mean(v)
  v <- v / sd(v)
  v <- v / o.norm(v)
  return(v)
 }

 a1 <- v.normalize(a)

 # If I got it right, the next line of code should produce: > [1] TRUE
 all(round(p[ ,1], 4) == round(a1, 4))

 # What about the higher degrees?
 a2 <- v.normalize(a^2)
 all(round(p[ ,2], 4) == round(a2, 4))
 # That's not right

 # Let's see what they look like:
 plot(p[,2], pch=19)
  points(a2, pch=19, col='blue')
  lines(p[,2])
  lines(a2, col='blue')

 # I don't know how to make sense of this

 ```

 I think what we're doing is creating a higher order polynomial by adding features (or columns) to the data set and then fitting a linear model to the new data set, so the raw=TRUE format makes sense to me.

 Reading through ?poly and other resources, the raw=FALSE option creates orthagonal polynomials. This is to reduce multicollinearity:

 From wikipedia: http://en.wikipedia.org/wiki/Multicollinearity  
 Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

 But looking at the graph, it feels like we're just making up data points. ??????
	### poly

	How does poly work?
	```{r}
	a <- 1:10

	# Let's start easy
	p <- poly(a, 3, raw=TRUE)
	p
	# This is easy to reproduce
	data.frame('1'=a, '2'=a^2, '3'=a^3)

	# So what about:
	p <- poly(a, 3, raw=FALSE) # raw=FALSE is the default option
	# Can I reproduce this?

	# First let's define a couple of functions to make this easier

	# vector length, like the octave norm() function
	o.norm <- function(v){return(sqrt(sum(v*v)))}

	# Normalize
	v.normalize <- function(v){
	v <- v - mean(v)
	v <- v / sd(v)
	v <- v / o.norm(v)
	return(v)
	}

	a1 <- v.normalize(a)

	# If I got it right, the next line of code should produce: > [1] TRUE
	all(round(p[ ,1], 4) == round(a1, 4))

	# What about the higher degrees?
	a2 <- v.normalize(a^2)
	all(round(p[ ,2], 4) == round(a2, 4))
	# That's not right

	# Let's see what they look like:
	plot(p[,2], pch=19)
	points(a2, pch=19, col='blue')
	lines(p[,2])
	lines(a2, col='blue')

	# I don't know how to make sense of this

	```

	I think what we're doing is creating a higher order polynomial by adding features (or columns) to the data set and then fitting a linear model to the new data set, so the raw=TRUE format makes sense to me.

	Reading through ?poly and other resources, the raw=FALSE option creates orthagonal polynomials. This is to reduce multicollinearity:

	From wikipedia: http://en.wikipedia.org/wiki/Multicollinearity
	Multicollinearity is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data themselves; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.

	But looking at the graph, it feels like we're just making up data points. ??????