Created
March 4, 2018 07:25
-
-
Save tim-fan/bf66f1d57ff0fd45af6f5c65b6530f96 to your computer and use it in GitHub Desktop.
Project report for Coursera course 'Regression Models', March 2018
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: 'Regression Models Project: mtcars' | |
author: "Tim F" | |
date: "4 March 2018" | |
output: pdf_document | |
--- | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE) | |
knitr::opts_chunk$set() | |
``` | |
## Executive summary | |
This analysis seeks to answer the following two questions: | |
1. Is an automatic or manual transmission better for fuel economy (measured in miles per gallon)? | |
2. Quantify the miles per gallon difference between automatic and manual transmissions | |
The questions are addressed via the application of linear regression over the mtcars[^1] dataset. | |
The analysis concludes that after accounting for weight of the car, the dataset does not show any significant | |
influence of transmission type on fuel economy. Hence the answers to the above questions, with reference to the fitted model are: | |
1. Neither transmission is seen to be significantly better for fuel economy in the mtcars dataset | |
2. The miles-per-gallon difference between automatic and manual transmissions is zero (null hypothesis is not rejected) | |
The following sections outline the analysis conducted to reach these conclusions. | |
## Exploratory Analysis | |
In order to gain an appreciation for the variables present in the mtcars dataset, their pairwise relations are plotted as so: | |
```{r explore, eval=FALSE} | |
library(GGally) | |
ggpairs(mtcars) | |
``` | |
See appendix 1 for the resultant figure. | |
From the generated pairwise plots, we see that transmission (am) is correlated with fuel economy (mpg) in this dataset. | |
Of all variables, weight (wt) shows the highest correlation with fuel economy. | |
As described in [^1], it is expected from physical principles that weight should be proportional to | |
gallons-per-mile, so inversely proportional to miles-per-gallon. This is supported by the strong | |
negative correlation between mpg and wt. | |
[^1]: Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411. | |
## Naive Model | |
As a first step, the following simple model is fitted between the two variables of interest: | |
```{r naive} | |
fit1 <- lm(mpg ~ am, mtcars) | |
summary(fit1)$coefficients | |
``` | |
The resulting model shows manual cars have average fuel economy of `r round((coef(fit1)[1]),2)` mpg | |
while automatic cars have average fuel economy of `r round(coef(fit1)[1] + coef(fit1)[2],2)` mpg. | |
This suggests automatic cars travel `r round(coef(fit1)[2],2)` miles further per gallon fuel. | |
This difference in fuel economy is statistically significant (p = 0.0003). A full summary of the model is provided in appendix 2. | |
## Accounting for Weight | |
From the coursework we know that omitting a variable which is correlated with the included variables leads | |
to bias in the fitted model. | |
From the exploratory analysis we know that fuel economy is strongly correlated with weight, and that weight | |
is correlated with transmisssion type. Has the omission of weight in the simple model led to a bias in the estimated coefficients? | |
To address this possibilty, a model is fitted which includes weight as a regressor: | |
```{r weightMod} | |
fit2 <- lm(mpg ~ wt + am, mtcars) | |
summary(fit2)$coefficients | |
``` | |
The model including weight appears to fit the data much better (R-squared = `r round(summary(fit2)$r.squared,2)` vs `r round(summary(fit1)$r.squared,2)` for am-only model) | |
Furthermore, now that weight is accounted for, the effect of transmission on fuel economy seems to have dissappeared. | |
The am variable is given a small negative coefficient in the model (`r round(fit2$coefficients[3],2)`), which is easily explained by the | |
null-hypothesis, that transmission type has no effect on fuel economy (p=`r round(summary(fit2)$coefficients[3,4],2) `). | |
Hence the anaylsis suggests that transmission type does not affect fuel economy. | |
## Model Validation | |
To investigate the validity of the fitted model, the residuals are plotted: | |
```{r residuals, eval=FALSE} | |
par(mfrow = c(2, 2)) | |
plot(fit2) | |
``` | |
See appendix 4 for the resultant figure. | |
There does not appear to be any major issues apparent in the residual plots - there are no clear | |
systematic patterns in the residuals v. fitted plot, and the Q-Q plot shows the residuals to be approximately | |
normally distributed. | |
The plots do identify a few outlier points which are not well fitted by the model. The Chrysler Imperial is of particular | |
concern due to relatively high leverage. Future work could focus on understanding and/or mitigating the effect of | |
these outliers. | |
Overall the residual plots do not suggests any major issues with the chosen model. | |
## Conclusion | |
The above analysis of the mtcars dataset suggests that after accounting for vehicle weight, transmission type shows no significant effect on fuel economy. | |
## Appendix | |
### Appendix 1: mtcars pairwise relations | |
```{r apx1, echo=FALSE, cache=TRUE} | |
library(GGally) | |
ggpairs(mtcars) | |
``` | |
### Appendix 2: summary of model fit1 | |
```{r apx2, echo=FALSE} | |
summary(fit1) | |
``` | |
### Appendix 3: summary of model fit2 | |
```{r apx3, echo=FALSE} | |
summary(fit2) | |
``` | |
### Appendix 4: Fit2 residual plots | |
```{r apx4, echo=FALSE} | |
par(mfrow = c(2, 2)) | |
plot(fit2) | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment