Using the correlation matrix and heatmap, we find that:
- 'MedInc' (median income) has the strongest positive correlation with the target variable ('MedHouseVal').
- 'AveRooms' and 'HouseAge' have moderate correlations.
- 'AveOccup' and 'Population' are less correlated.
- From the scatter plot of 'MedInc' vs 'MedHouseVal', we see a clear positive trend.
- Higher median income is associated with higher median house values.
- The relationship appears nonlinear at the high end (suggesting diminishing returns).
- The residuals show some heteroscedasticity: the spread of residuals increases with the predicted value.
- This violates the assumption of constant variance (homoscedasticity).
- Suggests the need for transforming the response or using robust regression methods.
- Linearity: Mostly holds, especially for 'MedInc'. But some relationships appear nonlinear.
- Heteroscedasticity: Residual plot shows increasing variance — assumption is violated.
- Multicollinearity: Correlation heatmap suggests moderate multicollinearity between some predictors (e.g., AveRooms and AveOccup).
- Normality of residuals: The histogram shows residuals are roughly normal but with some skew.
- Outliers: A few points with large residuals suggest outliers.
- Feature engineering: polynomial terms or interaction terms.
- Use regularized regression (e.g., Ridge or Lasso).
- Apply log transformation to skewed predictors or response variable.
- Use tree-based models or nonlinear regression models.