Hessian and Likelihood

code

Stats

A discussion on confidence intervals

Author

Zhenglei Gao

Published

November 20, 2024

I was discussing with Johannes on how to calculate confidence intervals for a growth model, or more in general, confidence interval, credibal interval, prediction interval, tolerance interval. I need to write that up too. I have written him a formula for deriving the SEs from the Hessian matrix of the optim output or using the numDeriv::hessian function. However, our beloved myGenAssist does not agree with me. After a clarification with the assistant, we reached an agreement.

If you are maximizing the log(likelihood), then the negative of the hessian is the “observed information” (Fisher information matrix). If you minimize a “deviance” = (-2)*log(likelihood), then half of the hessian is the observed information. The sigma is then the square root of the diagnal element of the fisher information matrix. In the unlikely event that you are maximizing the likelihood itself, you need to divide the negative of the hessian by the likelihood to get the observed information.

The second derivative of the log-likelihood evaluated at the maximum likelihood estimates (MLE) is the observed Fisher information. Most optimization algorithms in R like optim return this Hesiian. If the negative log-likelihood

Relationship between ML and Least squares

Least square is an estimate \(\hat{\mu}^\text{LS}\) the value of \(\mu\) that minimizes \(\sum (y_i-\mu)^2\). MLE is an estimate \(\hat{\mu}^\text{MLE}\) that maximizes the likelihood function \(L(\mu) = \prod f(y_i|\mu)\), or \(\sum_i \log(f(y_i;\mu))\).

Suppose \[ y = X\beta + \varepsilon, \; \varepsilon_i \overset{\mathrm{i.i.d.}}{\sim} \mathcal{N}(0, \sigma^2) \; , \]

\[ \mathrm{LL}(\beta,\sigma^2) = \log p(y|\beta,\sigma^2) = - \frac{n}{2} \log(2\pi\sigma^2) - \frac{1}{2\sigma^2} (y - X\beta)^\mathrm{T} (y - X\beta) \; , \]

Maximizing LL is equal to minimizing RSS \(\sum(y_i-X_i\beta)^2\) and

\[ \mathrm{RSS} = - 2\sigma^2 (\cdot \mathrm{LL}(\beta,\sigma^2) + \frac{n}{2} \log(2\pi\sigma^2)) \]

Therefore, the 1/2 hessian of the objective function of least square estimates should be used as H.

let \(\r_i = f(x_i, beta) - y_i\), rss=\(r^Tr\).

SE = sqrt(diag(solve(H))*ssr/(n-p)))

Given \(\hat\beta\),

\[ \mathrm{RSS} = \sum_{i=1}^n \hat{\varepsilon}_i = \hat{\varepsilon}^\mathrm{T} \hat{\varepsilon} = (y - X\hat{\beta})^\mathrm{T} (y - X\hat{\beta}) = n \cdot \hat{\sigma}^2 \; . \]

Maximum log-likelihood is:

\[ \mathrm{MLL} = \mathrm{LL}(\hat{\beta},\hat{\sigma}^2) = - \frac{n}{2} \log(2\pi\hat{\sigma}^2) - \frac{1}{2\hat{\sigma}^2} (y - X\hat{\beta})^\mathrm{T} (y - X\hat{\beta}) \; . \]

\[ \mathrm{MLL} = - \frac{n}{2} \log(\mathrm{RSS}) - \frac{n}{2} \log \left( \frac{2\pi}{n} \right) - \frac{n}{2} \; . \]

SE = sqrt(diag(solve(hessian))*RSS/dof)

# 
library(minpack.lm)

Warning: package 'minpack.lm' was built under R version 4.4.2

###### example 1

## values over which to simulate data 
x <- seq(0,5,length=100)

## model based on a list of parameters 
getPred <- function(parS, xx) parS$a * exp(xx * parS$b) + parS$c 

## parameter values used to simulate data
pp <- list(a=9,b=-1, c=6) 

## simulated data, with noise  
simDNoisy <- getPred(pp,x) + rnorm(length(x),sd=.1)
 
## plot data
plot(x,simDNoisy, main="data")

## residual function 
residFun <- function(p, observed, xx) observed - getPred(p,xx)

## starting values for parameters  
parStart <- list(a=3,b=-.001, c=1)

## perform fit 
nls.out <- nls.lm(par=parStart, fn = residFun, observed = simDNoisy,
xx = x, control = nls.lm.control(nprint=1))

It.    0, RSS =    1980.87, Par. =          3     -0.001          1
It.    1, RSS =    330.796, Par. =    5.25147  -0.149225    3.23654
It.    2, RSS =    103.008, Par. =    6.69678   -0.30771    4.27851
It.    3, RSS =    51.6017, Par. =    7.49744  -0.424799    4.70064
It.    4, RSS =    31.3998, Par. =    7.67694  -0.714156    6.11027
It.    5, RSS =    2.73777, Par. =    8.73454   -0.99892    6.16331
It.    6, RSS =    1.08719, Par. =    9.00372  -0.988844    5.98595
It.    7, RSS =    1.08713, Par. =      9.004  -0.989232    5.98589
It.    8, RSS =    1.08713, Par. =      9.004  -0.989231    5.98589

## plot model evaluated at final parameter estimates  
lines(x,getPred(as.list(coef(nls.out)), x), col=2, lwd=2)

## summary information on parameter estimates
summary(nls.out)


Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a  9.00400    0.04531  198.71   <2e-16 ***
b -0.98923    0.01097  -90.16   <2e-16 ***
c  5.98589    0.02076  288.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1059 on 97 degrees of freedom
Number of iterations to termination: 8 
Reason for termination: Relative error in the sum of squares is at most `ftol'.

sqrt(diag(solve(nls.out$hessian)))*sqrt(nls.out$deviance/97)

         a          b          c 
0.04531158 0.01097196 0.02076155

When MLE is problematic

Anyway, I first did a search on Hessian, likelihood, Fisher information matrix and came across [this post] (https://stats.stackexchange.com/questions/381984/standard-error-from-hessian-matrix-when-likelihood-is-used-rather-than-ln-l). Below is to reproduce the results of that post.

Denote \[l(x) = \ln L(x)\] and ’ for differentiation,

\[l'(x) = \frac{L'(x)}{L(x)}\]

and

\[l''(x) = \frac{L''(x)}{L(x)} - \left(\frac{L'(x)}{L(x)}\right)^2\]

At a critical point \(x_0\), \(L'(x_0)\) is 0, so

\[l''(x_0) = \frac{L''(x_0)}{L(x_0)}\]

The problem seems to be round-off error or objective functions being much, much smaller than the default stopping-rule thresholds as WarrenWeckesser and JimB pointed out.

  x = c(1, 3, 4, 5, 7.5, 10, 12, 23, 39, 40)
    
    # The *exact* MLE for the normal distribution is
    #    mu = mean(x),  sigma = sqrt(var(x)*(length(x) - 1)/length(x))
    
    p0 = c(mean=mean(x), sd=sd(x))
    
    Llikelihood <- function(par, x) {
       return(sum(dnorm(x, mean=par["mean"], sd=abs(par["sd"]), log = TRUE)))
    }
    o1 <- optim(par=p0, fn=Llikelihood, x=x, hessian=TRUE, method="CG", control=list(fnscale=-1, reltol=1e-12, maxit=10000))
    
    cat("\n*** Using log-likelihood ***\n\n")


*** Using log-likelihood ***

    cat("Estimated parameters:\n")

Estimated parameters:

    o1$par

    mean       sd 
14.45000 13.83194

    cat("\nEstimated Hessian:\n")


Estimated Hessian:

    o1$hessian

            mean         sd
mean -0.05226777  0.0000000
sd    0.00000000 -0.1045355

    cat("\nEstimated standard errors:\n")


Estimated standard errors:

    sqrt(diag(solve(-o1$hessian)))

    mean       sd 
4.374043 3.092915

    cat("\n*** Using (plain) likelihood ***\n\n")


*** Using (plain) likelihood ***

    likelihood <- function(par, x) {
     return(prod(dnorm(x, mean=par["mean"], sd=abs(par["sd"]), log = FALSE)))
    }
    o2 <- optim(par=p0, fn=likelihood, x=x, hessian=TRUE, method="CG", control=list(fnscale=-likelihood(p0, x), reltol=1e-12, maxit=10000))
    
    cat("Estimated parameters:\n")

Estimated parameters:

    o2$par

    mean       sd 
14.45000 13.83194

    cat("\nEstimated Hessian of neg. log likelihood:\n")


Estimated Hessian of neg. log likelihood:

    o2$hessian/o2$value

              mean            sd
mean -5.226776e-02 -5.404307e-11
sd   -5.404307e-11 -1.045355e-01

    cat("\nEstimated standard errors:\n")


Estimated standard errors:

    sqrt(diag(solve(-o2$hessian/o2$value)))

    mean       sd 
4.374043 3.092916

set.seed(12345)
x = rnorm(10,3,4)
x = rnorm(100,3,4)
Llikelihood <- function(par, x) {
    return(sum(dnorm(x, mean=par["mean"], sd=abs(par["sd"]), log = TRUE)))
 }
o <- optim(par=c(mean=3, sd=4), fn=Llikelihood, x=x, hessian = TRUE, control=list(fnscale=-1))
sqrt(diag(solve(-o$hessian)))

     mean        sd 
0.4440061 0.3139321

likelihood <- function(par, x) {
  return(prod(dnorm(x, mean=par["mean"], sd=abs(par["sd"]), log = FALSE)))
}

likelihood_amplified <- function(par, x) {
  return(10^128*prod(dnorm(x, mean=par["mean"], sd=abs(par["sd"]), log = FALSE)))
 }
o <- optim(par=c(mean=3, sd=4), fn=likelihood, x=x, hessian = TRUE, control=list(fnscale=-1))
sqrt(diag(solve(-o$hessian/o$value)))

     mean        sd 
0.2792765 0.1573949

o <- optim(par=c(mean=3, sd=4), fn=likelihood_amplified, x=x, hessian = TRUE, control=list(fnscale=-1))
sqrt(diag(solve(-o$hessian/o$value)))

     mean        sd 
0.4440330 0.3139792

Other resources

https://biostatsr.blogspot.com/2017/05/standard-error-from-hessian-matrix-when.html
https://web.stanford.edu/~mrosenfe/soc_meth_proj3/matrix_OLS_NYU_notes.pdf
https://uu.diva-portal.org/smash/get/diva2:1810474/FULLTEXT01.pdf
https://stats.stackexchange.com/questions/52829/why-is-a-likelihood-ratio-test-distributed-chi-squared
https://books.google.fr/books?id=vmoFw0pvS1sC&printsec=frontcover&hl=fr&source=gbs_ge_summary_r&cad=0#v=onepage&q&f=false

warnings.filterwarnings("ignore", category=RuntimeWarning, message="overflow encountered in square")
        # warning above occurs when RSS is calculated as Hessian step can be too large
        H = nd.Hessian(residual)(np.copy(result.x))
        warnings.filterwarnings("default", category=RuntimeWarning) # turn warnings on again
        C = np.linalg.inv(2*H) # discuss further why Hessian should be multiplied by two
        V = np.diag(C)
        SE = np.sqrt(V * resvar)
        SE[0] = SE[0] * scale_factor #  rescale standard error for M0
        logging.info("Hessian matrix calculated succesfully.")