Pseudo \(R^2\)

code
Stats
R
A discussion on confidence intervals
Author

Zhenglei Gao

Published

November 20, 2024

When discussing with Johannes, I mentioned that GoF is making sense when comparing with other models, most of the time with Null or full models.

General considerations

When used in nonlinear regression, the pseudo \(R^2\) cannot be interpreted as in linear regression.

\(R^2\) can be misleading

x <- rnorm(100)
y <- 3*x+rnorm(100)
y1 <- y-0.1
y2 <- y- 1000+rnorm(100)
summary(lm(y~x))  ## R^2 = 0.8855

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.25328 -0.70224  0.08219  0.69038  2.15758 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02152    0.09338    0.23    0.818    
x            2.96929    0.09883   30.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9326 on 98 degrees of freedom
Multiple R-squared:  0.9021,    Adjusted R-squared:  0.9011 
F-statistic: 902.6 on 1 and 98 DF,  p-value: < 2.2e-16
summary(lm(y~y1)) ## R2 = 1 ## Is this a good fit, no. It is a biased fit
Warning in summary.lm(lm(y ~ y1)): essentially perfect fit: summary may be
unreliable

Call:
lm(formula = y ~ y1)

Residuals:
       Min         1Q     Median         3Q        Max 
-5.132e-15 -9.190e-17 -9.400e-18  1.499e-16  1.448e-15 

Coefficients:
             Estimate Std. Error   t value Pr(>|t|)    
(Intercept) 1.000e-01  5.710e-17 1.751e+15   <2e-16 ***
y1          1.000e+00  1.935e-17 5.168e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.708e-16 on 98 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 2.671e+33 on 1 and 98 DF,  p-value: < 2.2e-16
summary(lm(y~y2)) ## R2 = 1 ## Is this a good fit, no. It is a biased fit

Call:
lm(formula = y ~ y2)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.22354 -0.57022  0.04311  0.64469  1.97676 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 919.8437    29.5910   31.09   <2e-16 ***
y2            0.9199     0.0296   31.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9044 on 98 degrees of freedom
Multiple R-squared:  0.9079,    Adjusted R-squared:  0.907 
F-statistic:   966 on 1 and 98 DF,  p-value: < 2.2e-16
plot(y,y1,ylim=c(-1000,10))
points(y,y2,col="red")

x <- seq(1,100,1)
length(x)
[1] 100
y <- 3*x+rnorm(length(x),0,5)

summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1439  -3.1839  -0.5689   3.1070  17.5196 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.01576    1.05379   1.913   0.0587 .  
x            2.96645    0.01812 163.744   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.229 on 98 degrees of freedom
Multiple R-squared:  0.9964,    Adjusted R-squared:  0.9963 
F-statistic: 2.681e+04 on 1 and 98 DF,  p-value: < 2.2e-16
y1 <- y*0.9+rnorm(length(x),0,0.11) ## systematic error!
plot(x,y)
points(x,y1,col="red")

plot(y,y1)

summary(lm(y~y1))

Call:
lm(formula = y ~ y1)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.35758 -0.08840  0.00361  0.08666  0.29338 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept) -0.0112427  0.0249022   -0.451    0.653    
y1           1.1111995  0.0001587 7003.071   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1225 on 98 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.904e+07 on 1 and 98 DF,  p-value: < 2.2e-16
summary(lm(y~x))

Call:
lm(formula = y ~ x)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1439  -3.1839  -0.5689   3.1070  17.5196 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.01576    1.05379   1.913   0.0587 .  
x            2.96645    0.01812 163.744   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.229 on 98 degrees of freedom
Multiple R-squared:  0.9964,    Adjusted R-squared:  0.9963 
F-statistic: 2.681e+04 on 1 and 98 DF,  p-value: < 2.2e-16
ypred <- predict(lm(y~x))
summary(lm(y~ypred))

Call:
lm(formula = y ~ ypred)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.1439  -3.1839  -0.5689   3.1070  17.5196 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.000000   1.064498     0.0        1    
ypred       1.000000   0.006107   163.7   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.229 on 98 degrees of freedom
Multiple R-squared:  0.9964,    Adjusted R-squared:  0.9963 
F-statistic: 2.681e+04 on 1 and 98 DF,  p-value: < 2.2e-16

Try with some examples

From the pscl package:

  • llh: The log-likelihood from the fitted model
  • llhNull: The log-likelihood from the intercept-only restricted model
  • G2: Minus two times the difference in the log-likelihoods
  • McFadden: McFadden’s pseudo r-squared
  • r2ML: Maximum likelihood pseudo r-squared
  • r2CU: Cragg and Uhler’s pseudo r-squared

Example with random data

# Load necessary libraries
library(MASS)
## install.packages("pscl")
library(pscl)  # For the pseudo R-squared function
Warning: package 'pscl' was built under R version 4.4.2
Classes and Methods for R originally developed in the
Political Science Computational Laboratory
Department of Political Science
Stanford University (2002-2015),
by and under the direction of Simon Jackman.
hurdle and zeroinfl functions by Achim Zeileis.
# Simulate some data
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = 0.5)  # Random binary outcome that is not related to x

# Fit a logistic regression model
model <- glm(y ~ x, family = binomial)

# Calculate pseudo R-squared
pseudo_r2 <- pR2(model)
fitting null model for pseudo-r2
print(pseudo_r2)
          llh       llhNull            G2      McFadden          r2ML 
-68.920601482 -69.234696709   0.628190454   0.004536674   0.006262215 
         r2CU 
  0.008354080 
summary(lm(y~predict(model)))

Call:
lm(formula = y ~ predict(model))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.5428 -0.4866 -0.3965  0.5260  0.5735 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     0.49998    0.05637   8.870 3.42e-14 ***
predict(model)  0.24817    0.31575   0.786    0.434    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5031 on 98 degrees of freedom
Multiple R-squared:  0.006264,  Adjusted R-squared:  -0.003876 
F-statistic: 0.6177 on 1 and 98 DF,  p-value: 0.4338
# Check the model summary
summary(model)

Call:
glm(formula = y ~ x, family = binomial)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.09639    0.20191  -0.477    0.633
x            0.17543    0.22231   0.789    0.430

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 137.84  on 98  degrees of freedom
AIC: 141.84

Number of Fisher Scoring iterations: 3
# Visualize the fit
plot(x, y, main = "Logistic Regression Fit", xlab = "x", ylab = "y")
points(x, fitted(model), col = "red", pch = 19)

Example where R2 works

# Simulate data with outliers
# Load necessary libraries
library(MASS)
library(pscl)  # For the pseudo R-squared function

# Simulate some data
set.seed(456)
n <- 100
x <- rnorm(n)
x0 <- x
# Create a binary outcome based on a logistic function
y <- rbinom(n, 1, prob = plogis(x))
y0 <- y
# Introduce influential points by adding extreme values to the predictor
r1 <- order(x)
ind <- r1[91:100]
ind <- r1[1:10]
x[ind] <- rep(3, 10)  # Make these points influential by shifting x

# Recalculate the binary response based on the new x values
y[ind] <- rbinom(10, 1, prob = plogis(x[ind]))  # Ensure y is binary

# Fit a logistic regression model
model_with_influential_points <- glm(y ~ x0, family = binomial)
summary(lm(y~predict(model_with_influential_points)))

Call:
lm(formula = y ~ predict(model_with_influential_points))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6966 -0.5061  0.3009  0.4337  0.6795 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             0.49994    0.05396   9.265 4.77e-15 ***
predict(model_with_influential_points)  0.24128    0.12965   1.861   0.0657 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4948 on 98 degrees of freedom
Multiple R-squared:  0.03413,   Adjusted R-squared:  0.02428 
F-statistic: 3.463 on 1 and 98 DF,  p-value: 0.06574
model_without_influential_points <- glm(y ~ x, family = binomial)
summary(lm(y0~predict(model_without_influential_points)))

Call:
lm(formula = y0 ~ predict(model_without_influential_points))

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6667 -0.4104 -0.3316  0.5131  0.6904 

Coefficients:
                                          Estimate Std. Error t value Pr(>|t|)
(Intercept)                                0.43755    0.05067   8.635  1.1e-13
predict(model_without_influential_points)  0.05975    0.03029   1.973   0.0513
                                             
(Intercept)                               ***
predict(model_without_influential_points) .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4937 on 98 degrees of freedom
Multiple R-squared:  0.03819,   Adjusted R-squared:  0.02838 
F-statistic: 3.892 on 1 and 98 DF,  p-value: 0.05135
# Calculate pseudo R-squared
pseudo_r2_outliers <- pR2(model_with_influential_points)
fitting null model for pseudo-r2
print(pseudo_r2_outliers)
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-67.26284391 -68.99437585   3.46306388   0.02509671   0.03403786   0.04548126 
rbind(pseudo_r2_outliers,pR2(model_without_influential_points))
fitting null model for pseudo-r2
                         llh   llhNull        G2   McFadden       r2ML
pseudo_r2_outliers -67.26284 -68.99438  3.463064 0.02509671 0.03403786
                   -51.49928 -68.99438 34.990197 0.25357282 0.29524283
                         r2CU
pseudo_r2_outliers 0.04548126
                   0.39450231
# Check the model summary
summary(model_with_influential_points)

Call:
glm(formula = y ~ x0, family = binomial)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)   0.1199     0.2051   0.585   0.5588  
x0            0.3830     0.2103   1.821   0.0686 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 137.99  on 99  degrees of freedom
Residual deviance: 134.53  on 98  degrees of freedom
AIC: 138.53

Number of Fisher Scoring iterations: 4
# Predict probabilities
predicted_probabilities <- predict(model_with_influential_points, type = "response")

# Classify predictions (using 0.5 as the cutoff)
predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)

# Calculate classification error
classification_error <- mean(predicted_classes != y)
print(paste("Classification Error:", round(classification_error, 4)))
[1] "Classification Error: 0.42"
# Predict probabilities
predicted_probabilities <- predict(model_without_influential_points, type = "response")

# Classify predictions (using 0.5 as the cutoff)
predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)

# Calculate classification error
classification_error <- mean(predicted_classes != y0)
print(paste("Classification Error:", round(classification_error, 4)))
[1] "Classification Error: 0.3"
# Visualize the fit
plot(x0, y, main = "Logistic Regression Fit with Influential Points", xlab = "x", ylab = "y", ylim = c(-0.1, 1.1))
points(x0, fitted(model_with_influential_points), col = "green", pch = 3)
points(x0[ind], y0[ind]-0.1, col = "red", pch = 3)
points(x0[ind], y[ind]-0.05, col = "darkgreen", pch = 4)
points(x0, fitted(model_without_influential_points), col = "red", pch = 19)

A case where R^2 can be 0 to 1 depending on the data.

depending on x is between (-3,3), (-3,6), (-1,3)!

# Create data that does not fit a logistic model well
set.seed(789)
n <- 100
x <- seq(-3, 6, length.out = n)
y <- ifelse(x^2 > 1, 1, 0)  # Nonlinear relationship
y1 <- ifelse(x > 0, 1, 0)  # Perfect fit
# Fit a logistic regression model
model_nonlinear <- glm(y ~ x, family = binomial)

# Calculate pseudo R-squared
pseudo_r2_nonlinear <- pR2(model_nonlinear)
fitting null model for pseudo-r2
print(pseudo_r2_nonlinear)
         llh      llhNull           G2     McFadden         r2ML         r2CU 
-48.80265582 -53.92763415  10.24995667   0.09503436   0.09742146   0.14762782 
xa <- rep(1,length(y))
summary(lm(y~predict(model_nonlinear)))

Call:
lm(formula = y ~ predict(model_nonlinear))

Residuals:
     Min       1Q   Median       3Q      Max 
-0.74495  0.01252  0.12525  0.23797  0.45545 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               0.54857    0.07910   6.935 4.38e-10 ***
predict(model_nonlinear)  0.16011    0.04918   3.256  0.00155 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4038 on 98 degrees of freedom
Multiple R-squared:  0.0976,    Adjusted R-squared:  0.08839 
F-statistic:  10.6 on 1 and 98 DF,  p-value: 0.001554
# Check the model summary
summary(model_nonlinear)

Call:
glm(formula = y ~ x, family = binomial)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)   0.9136     0.2547   3.588 0.000334 ***
x             0.3129     0.1054   2.969 0.002991 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 107.855  on 99  degrees of freedom
Residual deviance:  97.605  on 98  degrees of freedom
AIC: 101.61

Number of Fisher Scoring iterations: 4
# Visualize the fit
plot(x, y, main = "Logistic Regression Fit for Nonlinear Data", xlab = "x", ylab = "y")
points(x, fitted(model_nonlinear), col = "red", pch = 19)

Some Additional Notes

  • https://rcompanion.org/rcompanion/b_04.html

G-test = Likelihood ratio test?

observed = c(1752, 1895)    # observed frequencies
expected = c(0.5, 0.5)      # expected proportions

library(DescTools)
Warning: package 'DescTools' was built under R version 4.4.2
GTest(x=observed,
      p=expected,
      correct="none") 

    Log likelihood ratio (G-test) goodness of fit test

data:  observed
G = 5.6085, X-squared df = 1, p-value = 0.01787
degrees = 1                       # degrees of freedom



observed      = c(1752, 1895)     # observed frequencies
expected.prop = c(0.5, 0.5)       # expected proportions
expected.count = sum(observed)*expected.prop

G = 2 * sum(observed * log(observed / expected.count))

pchisq(G,
       df=degrees,
       lower.tail=FALSE)
[1] 0.01787343

\(R^2\) can only increase when adding more variables

An error in reporting R2.

The null mean model is different in those two types of model specifications, even though they are the same model.

Note

r.squared: R^2, the ‘fraction of variance explained by the model’, R^2 = 1 - Sum(R[i]^2) / Sum((y[i]- y)^2), where y is the mean of y[i] if there is an intercept and zero otherwise.

library("sigr")
Warning: package 'sigr' was built under R version 4.4.2
library("broom")
set.seed(23255)

d <- data.frame(y= runif(100),
                x= ifelse(runif(100)>=0.5, 'a', 'b'),
                stringsAsFactors = FALSE)
m1 <- lm(y~x, data=d)
t(broom::glance(m1))
                       [,1]
r.squared       0.002177326
adj.r.squared  -0.008004538
sigma           0.302851476
statistic       0.213843593
p.value         0.644796456
df              1.000000000
logLik        -21.432440763
AIC            48.864881526
BIC            56.680392084
deviance        8.988463618
df.residual    98.000000000
nobs          100.000000000
d$pred1 <- predict(m1, newdata = d)
cat(render(sigr::wrapFTest(d, 'pred1', 'y'),
           format='markdown'))
**F Test** summary: (<i>R^2^</i>=0.002177, *F*(1,98)=0.2138, *p*=n.s.).
m2 <- lm(y~0+x, data=d)
t(broom::glance(m2))
                       [,1]
r.squared      7.524811e-01
adj.r.squared  7.474297e-01
sigma          3.028515e-01
statistic      1.489647e+02
p.value        1.935559e-30
df             2.000000e+00
logLik        -2.143244e+01
AIC            4.886488e+01
BIC            5.668039e+01
deviance       8.988464e+00
df.residual    9.800000e+01
nobs           1.000000e+02
  • https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0183250