title	tags
About Statistical Learning	Machine Learning, ISL

About Statistical Learning

Refs: Chapter 2 - Introduction to Statistical Learning (ISL)

By convention:

X: predictors, features, input variable, independent variable or sometimes just variable.
y: label, response or dependent variable.

Assuming that there is some relationship between $Y$ and $X = (X_1,X_2,...,X_n)$, we can have a general form: $$ Y = f(X) + e $$

Y: Outcome
$f$: Unknown function of $X$, representing the systematic information that $X$ provides about $Y$.
X: Predictors
e: Error, has mean value of zero and independent of predictors $X$

Conclusion: In essence, Statistical Learning refers to a set of approaches for estimating $f$.

There are two main reasons to estimate $f$:

Prediction: We just only care about the prediction that f return. There are 2 types of errors in Prediction:
- Reducable error
- Irreducable error
Inference: We focus on the relationship between $X$ and $Y$, that is, the way $Y$ is affected when we change the predictors $X_1, X_2, ...X_N$.

Prediction

Since the error term averages to zeros, we can predict $Y$ using: $$ Y' = f'(X) $$

$f'$: represents the estimate of $f$, often treated as black box, which means one is not concerned with the exact form of $f'$, as long as it yields accurate predictions of $Y$.
$Y'$: represents the resulting prediction of $Y$

The accuracy of $Y'$ as a prediction of $Y$ depends on two quantities, which is called the reducible error and irreducible error. In most of the cases, $f'$ will not be a perfect estimation of $f$, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of $f'$ by using the most appropriate statiscal learning technique. However, even in the case that we actually obtain the perform of $f$, saying that $Y' = f(X)$, our prediction still have some errors in it. This is because the function of $Y$ actually contains the error $e$ that is independent of $X$. This error is called irreducible error, because no matter how well we estimate $f$, we cannot reduce the error introduced by $e$.

Why is there an Irreducible Error ?

This is because the quantity $e$ may contain unmeasured variables that are actually useful for predicting $Y$. However, since they are unmeasured, $f$ cannot use them for making prediction. For example: Given a problem of predicting the patient's risk for a severe adverse reaction to a particular drug. In such a case, the predictors $X=X_1, X_2, ..., X_n$ are characteristics of the patient, such as blood type, height, weight, etc. Additionaly, the risk of an adverse reaction might vary for a given patient on a given day, depending on manufacturing variation on the particular drug, or the patient's general feeling on well-being on that day. Therefore, it is noticable that these factors actually help in prediction, but they are unmeasured.

Inference

In some other cases, we are also interested in understanding the way that $Y$ is affected as $X$ change. In this situation, we wish to estimate $f$, but the goal is not necessarily to make predictions for $Y$. Instead, we want to understand how $Y$ changes as a function of $X_1,...,X_n$. Now, $f'$ cannot be treated as a black box as we need to know its exact form.

FAQs of Inference (p.19):

What predictors ($X$) are associated with the response ($Y$)? Only a small fraction of available predictors are substantially associated with $Y$. Indentiyfing the few important predictors among a large set of possible predictors is extremely useful. This relates to a technique called Feature Selection.
What is the relationship between the response $Y$ and each of the predictor? It could be positive relationship, in the sense that increasing the predictor is associated with increasing the value of $Y$ and vice versa. It also depends on the complexity of $f$, the relationship between the response and a given predictor may also depend on the values of other predictors.
Can the relationship between $Y$ and $X$ be summarized by a linear equation? Or is it more complicated ? Historically, most methods for estimating $f$ have taken a linear form, which is reasonable in some situations. But more often, the true relationship is more complicated, in which case a linear model may not provide an enough accurate representation of the relationship between the input and output.

Depending on whether our ultimate goal is prediction or inference, or both of them, we can have different ways to estimate $f$ function. For example, linear model can have a simpler interpretation but may not yield as high accurate predictions as other approaches. On the other hand, some of high non-linear approaches provide quite good predictions for $Y$.

The trade-off between Prediction Accuracy and Model Interpretability

Why would one want to use a more restrictive model (a simpler model) rather than a more flexible one (a more complex model)? If we are interested in Inference, then restrictive models are much more interpretable. For instance, a linear model is a good choice since it will be quite easy to understand the relationship between $Y$ and $X_1,X_2,...X_n$. In contrast, very flexible approaches (bagging, boosting, SVM with non-linear kernels) can lead to such complicated estimates of $f$ that it's difficult to understand how any individual predictor is associated with the response $Y$.

Flexible Model vs Overfitting

Consider a case in which we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is to predict accurately as best. Interpretability in this case is not concerned. In this setting, we might expect that it would be the best to use the most flexible model that fit this problem. Surprisingly, this is not always the case. We will often obtain a more accurate predictions using a less flexible method. This phenomenon, which may seen counterintuitive at first glance, relates to a problem of Overfitting, which often occurs in highly flexible methods.

Example of Inference and Prediction

ISL - p.20

tranctan/About Statistical Learning.md