Skip to content

Instantly share code, notes, and snippets.

@benjaminmgross
Last active July 25, 2023 23:45
Show Gist options
  • Save benjaminmgross/d71f161d48378d34b6970fa6d7378837 to your computer and use it in GitHub Desktop.
Save benjaminmgross/d71f161d48378d34b6970fa6d7378837 to your computer and use it in GitHub Desktop.
Predicted R-Squared (r2, r^2) Calculation in `python`
def press_statistic(y_true, y_pred, xs):
"""
Calculation of the `Press Statistics <https://www.otexts.org/1580>`_
"""
res = y_pred - y_true
hat = xs.dot(np.linalg.pinv(xs))
den = (1 - np.diagonal(hat))
sqr = np.square(res/den)
return sqr.sum()
def predicted_r2(y_true, y_pred, xs):
"""
Calculation of the `Predicted R-squared <https://rpubs.com/RatherBit/102428>`_
"""
press = press_statistic(y_true=y_true,
y_pred=y_pred,
xs=xs
)
sst = np.square( y_true - y_true.mean() ).sum()
return 1 - press / sst
def r2(y_true, y_pred):
"""
Calculation of the unadjusted r-squared, goodness of fit metric
"""
sse = np.square( y_pred - y_true ).sum()
sst = np.square( y_true - y_true.mean() ).sum()
return 1 - sse/sst
@benjaminmgross
Copy link
Author

@opoyraz:

  • den is just an abbreviation for "denominator" (it's calculated in the function, so it's not an input)
  • xs are the independent variable -- oftentimes referred to in a standard form, like $f(x) = y$
  • hat is just another intermediate calculation in the process -- but a description of the often-referred to "Hat Matrix"

Please see this link for more details on the PRESS Statistic in it linear algebra form.

@kinghendrix10
Copy link

Hello, I'm a bit new to this.

What is the value of y_true and y_pred?

@RafeyIqbalRahman
Copy link

Hello, I'm a bit new to this.

What is the value of y_true and y_pred?

The value of y_true depends on the dataset you are using to test your model. The value of y_pred depends on your model prediction.

@opoyraz
Copy link

opoyraz commented Jul 25, 2020

@opoyraz:

  • den is just an abbreviation for "denominator" (it's calculated in the function, so it's not an input)
  • xs are the independent variable -- oftentimes referred to in a standard form, like $f(x) = y$
  • hat is just another intermediate calculation in the process -- but a description of the often-referred to "Hat Matrix"

Please see this link for more details on the PRESS Statistic in it linear algebra form.

thank you

@SahilSK202
Copy link

I can see function r2 is used nowhere , my guess is we are comparing r2 with predicted_r2 (computed using PRESS), right ? please correct me if I'm getting this wrong.

@xuzhang5788
Copy link

Could you please give us an example to calculate predicted_r2? Many thanks

@dgamero898
Copy link

If xs is the independent variable, it should be a 1-dimensional array, which causes an error in the line hat = xs.dot(np.linalg.pinv(xs)) that reads LinAlgError: 1-dimensional array given. Array must be at least two-dimensional.

Am I missing something here?

@ELHoussineT
Copy link

Not quite sure about press_statistic, can you link how the calculation in press_statistic actually ends up with PRESS?

The link you shared in docstring is now invalid.

Thanks.

@ELHoussineT
Copy link

ELHoussineT commented Apr 19, 2022

Okay, I compared your code:

def press_statistic(y_true, y_pred, xs):
    """
    Calculation of the `Press Statistics <https://www.otexts.org/1580>`_
    """
    res = y_pred - y_true
    hat = xs.dot(np.linalg.pinv(xs))
    den = (1 - np.diagonal(hat))
    sqr = np.square(res/den)
    return sqr.sum()

with the original R implementation:

PRESS <- function(linear.model) {
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  
  return(PRESS)
}

And can confirm the results are the same:

Python:
Screenshot 2022-04-19 at 2 28 14 PM

R:
Screenshot 2022-04-19 at 2 27 43 PM


Thanks.

@ELHoussineT
Copy link

ELHoussineT commented Apr 19, 2022

However, you should make it clear that your code works only for models with an intercept. If you force the intercept to 0 (no intercept), your code will not work due to incorrect SST calculation:

You calculated the centered total sum of squares (SST):

sst  = np.square( y_true - y_true.mean() ).sum()

But the original R implementation calculates the uncentered total sum of squares:

+     #' Use anova() to get the sum of squares for the linear model
+     lm.anova <- anova(linear.model)
+     #' Calculate the total sum of squares
+     sst <- sum(lm.anova$'Sum Sq')

If you use the sample data in my comment above:

  • Your sst (centered) = 640.9
  • Original R implementation sst (uncentered) = 5437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment