tim-fan · March 4, 2018 07:25
diff --git a/regressionModelsProjectReport.Rmd b/regressionModelsProjectReport.Rmd
 ---
 title: 'Regression Models Project: mtcars'
 author: "Tim F"
 date: "4 March 2018"
 output: pdf_document
 ---

 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 knitr::opts_chunk$set()
 ```

 ## Executive summary
 This analysis seeks to answer the following two questions:

 1. Is an automatic or manual transmission better for fuel economy (measured in miles per gallon)?
 2. Quantify the miles per gallon difference between automatic and manual transmissions

 The questions are addressed via the application of linear regression over the mtcars[^1] dataset.
 The analysis concludes that after accounting for weight of the car, the dataset does not show any significant
 influence of transmission type on fuel economy. Hence the answers to the above questions, with reference to the fitted model are:

 1. Neither transmission is seen to be significantly better for fuel economy in the mtcars dataset
 2. The miles-per-gallon difference between automatic and manual transmissions is zero (null hypothesis is not rejected)

 The following sections outline the analysis conducted to reach these conclusions.

 ## Exploratory Analysis
 In order to gain an appreciation for the variables present in the mtcars dataset, their pairwise relations are plotted as so:
 ```{r explore, eval=FALSE}
 library(GGally)
 ggpairs(mtcars)
 ```
 See appendix 1 for the resultant figure.

 From the generated pairwise plots, we see that transmission (am) is correlated with fuel economy (mpg) in this dataset.
 Of all variables, weight (wt) shows the highest correlation with fuel economy.
 As described in [^1], it is expected from physical principles that weight should be proportional to 
 gallons-per-mile, so inversely proportional to miles-per-gallon. This is supported by the strong 
 negative correlation between mpg and wt.

 [^1]: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

 ## Naive Model
 As a first step, the following simple model is fitted between the two variables of interest:
 ```{r naive}
 fit1 <- lm(mpg ~ am, mtcars)
 summary(fit1)$coefficients
 ```
 The resulting model shows manual cars have average fuel economy of `r round((coef(fit1)[1]),2)` mpg 
 while automatic cars have average fuel economy of `r round(coef(fit1)[1] + coef(fit1)[2],2)` mpg.
 This suggests automatic cars travel `r round(coef(fit1)[2],2)` miles further per gallon fuel.
 This difference in fuel economy is statistically significant (p = 0.0003). A full summary of the model is provided in appendix 2. 

 ## Accounting for Weight 
 From the coursework we know that omitting a variable which is correlated with the included variables leads 
 to bias in the fitted model.
 From the exploratory analysis we know that fuel economy is strongly correlated with weight, and that weight
 is correlated with transmisssion type. Has the omission of weight in the simple model led to a bias in the estimated coefficients?
 To address this possibilty, a model is fitted which includes weight as a regressor:
 ```{r weightMod}
 fit2 <- lm(mpg ~ wt + am, mtcars)
 summary(fit2)$coefficients
 ```
 The model including weight appears to fit the data much better (R-squared = `r round(summary(fit2)$r.squared,2)` vs `r round(summary(fit1)$r.squared,2)` for am-only model)
 Furthermore, now that weight is accounted for, the effect of transmission on fuel economy seems to have dissappeared.
 The am variable is given a small negative coefficient in the model (`r round(fit2$coefficients[3],2)`), which is easily explained by the 
 null-hypothesis, that transmission type has no effect on fuel economy (p=`r round(summary(fit2)$coefficients[3,4],2) `).
 Hence the anaylsis suggests that transmission type does not affect fuel economy.

 ## Model Validation
 To investigate the validity of the fitted model, the residuals are plotted:
 ```{r residuals, eval=FALSE}
 par(mfrow = c(2, 2))
 plot(fit2)
 ```
 See appendix 4 for the resultant figure.

 There does not appear to be any major issues apparent in the residual plots - there are no clear 
 systematic patterns in the residuals v. fitted plot, and the Q-Q plot shows the residuals to be approximately
 normally distributed.

 The plots do identify a few outlier points which are not well fitted by the model. The Chrysler Imperial is of particular 
 concern due to relatively high leverage. Future work could focus on understanding and/or mitigating the effect of 
 these outliers.

 Overall the residual plots do not suggests any major issues with the chosen model.

 ## Conclusion
 The above analysis of the mtcars dataset suggests that after accounting for vehicle weight, transmission type shows no significant effect on fuel economy.

 ## Appendix
 ### Appendix 1: mtcars pairwise relations
 ```{r apx1, echo=FALSE, cache=TRUE}
 library(GGally)
 ggpairs(mtcars)
 ```

 ### Appendix 2: summary of model fit1
 ```{r apx2, echo=FALSE}
 summary(fit1)

 ```

 ### Appendix 3: summary of model fit2
 ```{r apx3, echo=FALSE}
 summary(fit2)

 ```

 ### Appendix 4: Fit2 residual plots
 ```{r apx4, echo=FALSE}
 par(mfrow = c(2, 2))
 plot(fit2)
 ```
	---
	title: 'Regression Models Project: mtcars'
	author: "Tim F"
	date: "4 March 2018"
	output: pdf_document
	---

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	knitr::opts_chunk$set()
	```

	## Executive summary
	This analysis seeks to answer the following two questions:

	1. Is an automatic or manual transmission better for fuel economy (measured in miles per gallon)?
	2. Quantify the miles per gallon difference between automatic and manual transmissions

	The questions are addressed via the application of linear regression over the mtcars[^1] dataset.
	The analysis concludes that after accounting for weight of the car, the dataset does not show any significant
	influence of transmission type on fuel economy. Hence the answers to the above questions, with reference to the fitted model are:

	1. Neither transmission is seen to be significantly better for fuel economy in the mtcars dataset
	2. The miles-per-gallon difference between automatic and manual transmissions is zero (null hypothesis is not rejected)

	The following sections outline the analysis conducted to reach these conclusions.

	## Exploratory Analysis
	In order to gain an appreciation for the variables present in the mtcars dataset, their pairwise relations are plotted as so:
	```{r explore, eval=FALSE}
	library(GGally)
	ggpairs(mtcars)
	```
	See appendix 1 for the resultant figure.

	From the generated pairwise plots, we see that transmission (am) is correlated with fuel economy (mpg) in this dataset.
	Of all variables, weight (wt) shows the highest correlation with fuel economy.
	As described in [^1], it is expected from physical principles that weight should be proportional to
	gallons-per-mile, so inversely proportional to miles-per-gallon. This is supported by the strong
	negative correlation between mpg and wt.

	[^1]: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

	## Naive Model
	As a first step, the following simple model is fitted between the two variables of interest:
	```{r naive}
	fit1 <- lm(mpg ~ am, mtcars)
	summary(fit1)$coefficients
	```
	The resulting model shows manual cars have average fuel economy of `r round((coef(fit1)[1]),2)` mpg
	while automatic cars have average fuel economy of `r round(coef(fit1)[1] + coef(fit1)[2],2)` mpg.
	This suggests automatic cars travel `r round(coef(fit1)[2],2)` miles further per gallon fuel.
	This difference in fuel economy is statistically significant (p = 0.0003). A full summary of the model is provided in appendix 2.

	## Accounting for Weight
	From the coursework we know that omitting a variable which is correlated with the included variables leads
	to bias in the fitted model.
	From the exploratory analysis we know that fuel economy is strongly correlated with weight, and that weight
	is correlated with transmisssion type. Has the omission of weight in the simple model led to a bias in the estimated coefficients?
	To address this possibilty, a model is fitted which includes weight as a regressor:
	```{r weightMod}
	fit2 <- lm(mpg ~ wt + am, mtcars)
	summary(fit2)$coefficients
	```
	The model including weight appears to fit the data much better (R-squared = `r round(summary(fit2)$r.squared,2)` vs `r round(summary(fit1)$r.squared,2)` for am-only model)
	Furthermore, now that weight is accounted for, the effect of transmission on fuel economy seems to have dissappeared.
	The am variable is given a small negative coefficient in the model (`r round(fit2$coefficients[3],2)`), which is easily explained by the
	null-hypothesis, that transmission type has no effect on fuel economy (p=`r round(summary(fit2)$coefficients[3,4],2) `).
	Hence the anaylsis suggests that transmission type does not affect fuel economy.

	## Model Validation
	To investigate the validity of the fitted model, the residuals are plotted:
	```{r residuals, eval=FALSE}
	par(mfrow = c(2, 2))
	plot(fit2)
	```
	See appendix 4 for the resultant figure.

	There does not appear to be any major issues apparent in the residual plots - there are no clear
	systematic patterns in the residuals v. fitted plot, and the Q-Q plot shows the residuals to be approximately
	normally distributed.

	The plots do identify a few outlier points which are not well fitted by the model. The Chrysler Imperial is of particular
	concern due to relatively high leverage. Future work could focus on understanding and/or mitigating the effect of
	these outliers.

	Overall the residual plots do not suggests any major issues with the chosen model.

	## Conclusion
	The above analysis of the mtcars dataset suggests that after accounting for vehicle weight, transmission type shows no significant effect on fuel economy.

	## Appendix
	### Appendix 1: mtcars pairwise relations
	```{r apx1, echo=FALSE, cache=TRUE}
	library(GGally)
	ggpairs(mtcars)
	```

	### Appendix 2: summary of model fit1
	```{r apx2, echo=FALSE}
	summary(fit1)

	```

	### Appendix 3: summary of model fit2
	```{r apx3, echo=FALSE}
	summary(fit2)

	```

	### Appendix 4: Fit2 residual plots
	```{r apx4, echo=FALSE}
	par(mfrow = c(2, 2))
	plot(fit2)
	```