gavinwhyte · June 13, 2017 04:49
diff --git a/pcainr.txt b/pcainr.txt
 Principal component analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis. 

 Reducing the dimensionality of a dataset can be useful in different ways. For example, our ability to visualize data is limited to 2 or 3 dimensions. 

 Lower dimension can sometimes significantly reduce the computational time of some numerical algorithms. 

 Besides, many statistical models suffer from high correlation between covariates, and PCA can be used to produce linear combinations of the covariates that are uncorrelated between each other.


 Computing PCA

 I will use the classical iris dataset for the demonstration. 
 The data contain four continuous variables which corresponds to physical measures of flowers and a 
 categorical variable describing the flowers’ species.

 # Load data
 > data(iris)
 > head(iris, 3)
 
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
 1          5.1         3.5          1.4         0.2  setosa
 2          4.9         3.0          1.4         0.2  setosa
 3          4.7         3.2          1.3         0.2  setosa


 We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later.
 Notice that in the following code we apply a log transformation to the continuous variables as suggested 
 by and set center and scale. 
 equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:


 # log transform 
 > log.ir <- log(iris[, 1:4])
 > ir.species <- iris[, 5]
 
 # apply PCA - scale. = TRUE is highly 
 # advisable, but default is FALSE. 
 > ir.pca <- prcomp(log.ir,
                 center = TRUE,
                 scale. = TRUE) 


 Since skewness and the magnitude of the variables influence the resulting PCs, 
 it is good practice to apply skewness transformation, 
 center and scale the variables prior to the application of PCA. 
 In the example above, we applied a log transformation to the variables but we could 
 have been more general and applied a Box and Cox transformation. 
 See at the end of this post how to perform all those transformations and then apply 
 PCA with only one call to the preProcess function of the caret package.

 Analyzing the results

 The prcomp function returns an object of class prcomp, which have some methods available. 
 The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings), 
 which are the coefficients of the linear combinations of the continuous variables.

 # print method
 > print(ir.pca)
 
 Standard deviations:
 [1] 1.7124583 0.9523797 0.3647029 0.1656840
 
 Rotation:
                    PC1         PC2        PC3         PC4
 Sepal.Length  0.5038236 -0.45499872  0.7088547  0.19147575
 Sepal.Width  -0.3023682 -0.88914419 -0.3311628 -0.09125405
 Petal.Length  0.5767881 -0.03378802 -0.2192793 -0.78618732
 Petal.Width   0.5674952 -0.03545628 -0.5829003  0.58044745


 The summary method describe the importance of the PCs. 
 The first row describe again the standard deviation associated with each PC. 
 The second row shows the proportion of the variance in the data explained by each component 
 while the third row describe the cumulative proportion of explained variance. 
 We can see there that the first two PCs accounts for more than {95\%} of the variance of the data.

 # summary method

 > summary(ir.pca)
 
 Importance of components:
                          PC1    PC2     PC3     PC4
 Standard deviation     1.7125 0.9524 0.36470 0.16568
 Proportion of Variance 0.7331 0.2268 0.03325 0.00686
 Cumulative Proportion  0.7331 0.9599 0.99314 1.00000

 We can use the predict function if we observe new data and want to predict their PCs values. 
 Just for illustration pretend the last two rows of the iris data 
 has just arrived and we want to see what is their PCs values:

 # Predict PCs
 > predict(ir.pca, 
        newdata=tail(log.ir, 2))
 
          PC1         PC2        PC3         PC4
 149 1.0809930 -1.01155751 -0.7082289 -0.06811063
 150 0.9712116 -0.06158655 -0.5008674 -0.12411524
	Principal component analysis (PCA) is a dimensionality reduction technique that is widely used in data analysis.

	Reducing the dimensionality of a dataset can be useful in different ways. For example, our ability to visualize data is limited to 2 or 3 dimensions.

	Lower dimension can sometimes significantly reduce the computational time of some numerical algorithms.

	Besides, many statistical models suffer from high correlation between covariates, and PCA can be used to produce linear combinations of the covariates that are uncorrelated between each other.


	Computing PCA

	I will use the classical iris dataset for the demonstration.
	The data contain four continuous variables which corresponds to physical measures of flowers and a
	categorical variable describing the flowers’ species.

	# Load data
	> data(iris)
	> head(iris, 3)

	Sepal.Length Sepal.Width Petal.Length Petal.Width Species
	1 5.1 3.5 1.4 0.2 setosa
	2 4.9 3.0 1.4 0.2 setosa
	3 4.7 3.2 1.3 0.2 setosa


	We will apply PCA to the four continuous variables and use the categorical variable to visualize the PCs later.
	Notice that in the following code we apply a log transformation to the continuous variables as suggested
	by and set center and scale.
	equal to TRUE in the call to prcomp to standardize the variables prior to the application of PCA:


	# log transform
	> log.ir <- log(iris[, 1:4])
	> ir.species <- iris[, 5]

	# apply PCA - scale. = TRUE is highly
	# advisable, but default is FALSE.
	> ir.pca <- prcomp(log.ir,
	center = TRUE,
	scale. = TRUE)


	Since skewness and the magnitude of the variables influence the resulting PCs,
	it is good practice to apply skewness transformation,
	center and scale the variables prior to the application of PCA.
	In the example above, we applied a log transformation to the variables but we could
	have been more general and applied a Box and Cox transformation.
	See at the end of this post how to perform all those transformations and then apply
	PCA with only one call to the preProcess function of the caret package.

	Analyzing the results

	The prcomp function returns an object of class prcomp, which have some methods available.
	The print method returns the standard deviation of each of the four PCs, and their rotation (or loadings),
	which are the coefficients of the linear combinations of the continuous variables.

	# print method
	> print(ir.pca)

	Standard deviations:
	[1] 1.7124583 0.9523797 0.3647029 0.1656840

	Rotation:
	PC1 PC2 PC3 PC4
	Sepal.Length 0.5038236 -0.45499872 0.7088547 0.19147575
	Sepal.Width -0.3023682 -0.88914419 -0.3311628 -0.09125405
	Petal.Length 0.5767881 -0.03378802 -0.2192793 -0.78618732
	Petal.Width 0.5674952 -0.03545628 -0.5829003 0.58044745


	The summary method describe the importance of the PCs.
	The first row describe again the standard deviation associated with each PC.
	The second row shows the proportion of the variance in the data explained by each component
	while the third row describe the cumulative proportion of explained variance.
	We can see there that the first two PCs accounts for more than {95\%} of the variance of the data.

	# summary method

	> summary(ir.pca)

	Importance of components:
	PC1 PC2 PC3 PC4
	Standard deviation 1.7125 0.9524 0.36470 0.16568
	Proportion of Variance 0.7331 0.2268 0.03325 0.00686
	Cumulative Proportion 0.7331 0.9599 0.99314 1.00000

	We can use the predict function if we observe new data and want to predict their PCs values.
	Just for illustration pretend the last two rows of the iris data
	has just arrived and we want to see what is their PCs values:

	# Predict PCs
	> predict(ir.pca,
	newdata=tail(log.ir, 2))

	PC1 PC2 PC3 PC4
	149 1.0809930 -1.01155751 -0.7082289 -0.06811063
	150 0.9712116 -0.06158655 -0.5008674 -0.12411524