GEE and categorical responses

code

analysis

Try using GEE for categorical responses

Author

Zhenglei Gao

Published

January 14, 2025

Use the data from previous posts

Multinomial or ordinal GEE

Data: Rheumatoid self-assessment scores for 302 patients, measured on a five-level ordinal response scale at three follow-up times.

library(multgee)

Warning: package 'multgee' was built under R version 4.4.2

Loading required package: gnm

Warning: package 'gnm' was built under R version 4.4.2

data(arthritis)
intrinsic.pars(y, arthritis, id, time) ## assess the underlying association pattern.

[1] 0.6517843 0.9097341 0.9022272

fitmod <- ordLORgee(formula = y ~ factor(time) + factor(trt) + factor(baseline),
  data = arthritis, id = id, repeated = time, LORstr = "uniform")
summary(fitmod)

GEE FOR ORDINAL MULTINOMIAL RESPONSES 
version 1.6.0 modified 2017-07-10 

Link : Cumulative logit 

Local Odds Ratios:
Structure:         uniform
Model:             3way

call:
ordLORgee(formula = y ~ factor(time) + factor(trt) + factor(baseline), 
    data = arthritis, id = id, repeated = time, LORstr = "uniform")

Summary of residuals:
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-0.5161111 -0.2399227 -0.0749670  0.0000219 -0.0066995  0.9932912 

Number of Iterations: 5 

Coefficients:
                  Estimate   san.se   san.z Pr(>|san.z|)    
beta10            -1.84315  0.38929 -4.7346      < 2e-16 ***
beta20             0.26692  0.35013  0.7624      0.44585    
beta30             2.23132  0.36625  6.0924      < 2e-16 ***
beta40             4.52542  0.42123 10.7434      < 2e-16 ***
factor(time)3      0.00140  0.12183  0.0115      0.99080    
factor(time)5     -0.36172  0.11395 -3.1743      0.00150 ** 
factor(trt)2      -0.51212  0.16799 -3.0486      0.00230 ** 
factor(baseline)2 -0.66963  0.38036 -1.7605      0.07832 .  
factor(baseline)3 -1.26070  0.35252 -3.5763      0.00035 ***
factor(baseline)4 -2.64373  0.41282 -6.4041      < 2e-16 ***
factor(baseline)5 -3.96613  0.53164 -7.4602      < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Local Odds Ratios Estimates:
       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
 [1,] 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257
 [2,] 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257
 [3,] 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257
 [4,] 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257
 [5,] 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257
 [6,] 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257
 [7,] 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257
 [8,] 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000 2.257 2.257 2.257 2.257
 [9,] 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000
[10,] 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000
[11,] 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000
[12,] 2.257 2.257 2.257 2.257 2.257 2.257 2.257 2.257 0.000 0.000 0.000 0.000

p-value of Null model: < 0.0001

confint(fitmod)

                       2.5 %      97.5 %
beta10            -2.6061497 -1.08015721
beta20            -0.4193163  0.95314883
beta30             1.5134867  2.94915281
beta40             3.6998267  5.35101277
factor(time)3     -0.2373805  0.24018865
factor(time)5     -0.5850535 -0.13837735
factor(trt)2      -0.8413778 -0.18287050
factor(baseline)2 -1.4151270  0.07587068
factor(baseline)3 -1.9516257 -0.56978096
factor(baseline)4 -3.4528305 -1.83462340
factor(baseline)5 -5.0081187 -2.92413445

library(ggstats)
ggcoef_multinom(fitmod)

ℹ multgee model detected.

✔ `tidy_multgee()` used instead.

ℹ Add `tidy_fun = broom.helpers::tidy_multgee` to quiet these messages.

What is the difference between Ordinal GEE and ordinal logistic regression?

GEE Basics

It is a quasilikelihood approach.
Could include cluster/group correlation structure(e.g., repeated measurements from one subject)
No distributional assumption.
Could handle binary or categorical responses (multinomial).
Computationally simpler for categorical compared to maximum likelihood approach.
No likelihood based post hoc testing, comparison or inference.

URL <- "http://static.lib.virginia.edu/statlab/materials/data/depression.csv"
dat <- read.csv(URL, stringsAsFactors = TRUE)
dat$id <- factor(dat$id)
dat$drug <- relevel(dat$drug, ref = "standard")
head(dat, n = 3)

  diagnose     drug id time depression
1     mild standard  1    0          1
2     mild standard  1    1          1
3     mild standard  1    2          1

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

with(dat, tapply(depression, list(diagnose, drug, time), mean)) %>% 
  ftable() %>% 
  round(2)

                    0    1    2
                               
mild   standard  0.51 0.59 0.68
       new       0.53 0.79 0.97
severe standard  0.21 0.28 0.46
       new       0.18 0.50 0.83

##install.packages("gee")
library(gee) # version 4.13-25

Warning: package 'gee' was built under R version 4.4.2

dep_gee <- gee(depression ~ diagnose + drug*time,
               data = dat, 
               id = id, 
               family = binomial,
               corstr = "independence")

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

running glm to get initial regression estimate

   (Intercept) diagnosesevere        drugnew           time   drugnew:time 
   -0.02798843    -1.31391092    -0.05960381     0.48241209     1.01744498

summary(dep_gee)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Independent 

Call:
gee(formula = depression ~ diagnose + drug * time, id = id, data = dat, 
    family = binomial, corstr = "independence")

Summary of Residuals:
        Min          1Q      Median          3Q         Max 
-0.94844242 -0.40683252  0.05155758  0.38830952  0.80242231 


Coefficients:
                  Estimate Naive S.E.    Naive z Robust S.E.   Robust z
(Intercept)    -0.02798843  0.1627083 -0.1720160   0.1741865 -0.1606808
diagnosesevere -1.31391092  0.1453432 -9.0400569   0.1459845 -9.0003423
drugnew        -0.05960381  0.2205812 -0.2702126   0.2285385 -0.2608042
time            0.48241209  0.1139224  4.2345663   0.1199350  4.0222784
drugnew:time    1.01744498  0.1874132  5.4288855   0.1876938  5.4207709

Estimated Scale Parameter:  0.9854113
Number of Iterations:  1

Working Correlation
     [,1] [,2] [,3]
[1,]    1    0    0
[2,]    0    1    0
[3,]    0    0    1

res <- glm(depression ~ diagnose + drug*time,
               data = dat,
               family = quasibinomial)
summary(res)


Call:
glm(formula = depression ~ diagnose + drug * time, family = quasibinomial, 
    data = dat)

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -0.02799    0.16271  -0.172    0.863    
diagnosesevere -1.31391    0.14534  -9.040  < 2e-16 ***
drugnew        -0.05960    0.22058  -0.270    0.787    
time            0.48241    0.11392   4.235 2.50e-05 ***
drugnew:time    1.01744    0.18741   5.429 7.09e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.9854187)

    Null deviance: 1411.9  on 1019  degrees of freedom
Residual deviance: 1161.9  on 1015  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 4

The “Estimated Scale Parameter” is reported as 0.9854113 = the dispersion parameter = similar to what is reported (Dispersion parameter for quasibinomial family taken to be 0.9854187) when running glm() with family = quasibinomial in order to model over-dispersion.

dep_gee2 <- gee(depression ~ diagnose + drug*time,
               data = dat, 
               id = id, 
               family = binomial,
               corstr = "exchangeable")

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27

running glm to get initial regression estimate

   (Intercept) diagnosesevere        drugnew           time   drugnew:time 
   -0.02798843    -1.31391092    -0.05960381     0.48241209     1.01744498

summary(dep_gee2)


 GEE:  GENERALIZED LINEAR MODELS FOR DEPENDENT DATA
 gee S-function, version 4.13 modified 98/01/27 (1998) 

Model:
 Link:                      Logit 
 Variance to Mean Relation: Binomial 
 Correlation Structure:     Exchangeable 

Call:
gee(formula = depression ~ diagnose + drug * time, id = id, data = dat, 
    family = binomial, corstr = "exchangeable")

Summary of Residuals:
        Min          1Q      Median          3Q         Max 
-0.94843397 -0.40683122  0.05156603  0.38832332  0.80238627 


Coefficients:
                  Estimate Naive S.E.    Naive z Robust S.E.   Robust z
(Intercept)    -0.02809866  0.1625499 -0.1728617   0.1741791 -0.1613205
diagnosesevere -1.31391033  0.1448627 -9.0700418   0.1459630 -9.0016667
drugnew        -0.05926689  0.2205340 -0.2687427   0.2285569 -0.2593091
time            0.48246420  0.1141154  4.2278625   0.1199383  4.0226037
drugnew:time    1.01719312  0.1877051  5.4191018   0.1877014  5.4192084

Estimated Scale Parameter:  0.985392
Number of Iterations:  2

Working Correlation
             [,1]         [,2]         [,3]
[1,]  1.000000000 -0.003432732 -0.003432732
[2,] -0.003432732  1.000000000 -0.003432732
[3,] -0.003432732 -0.003432732  1.000000000

dep_gee3 <- gee(depression ~ diagnose + drug*time,
               data = dat, 
               id = id, 
               family = binomial,
               corstr = "AR-M", Mv = 1)

Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
running glm to get initial regression estimate

   (Intercept) diagnosesevere        drugnew           time   drugnew:time 
   -0.02798843    -1.31391092    -0.05960381     0.48241209     1.01744498

dep_gee3$working.correlation

             [,1]        [,2]         [,3]
[1,] 1.000000e+00 0.008477443 7.186704e-05
[2,] 8.477443e-03 1.000000000 8.477443e-03
[3,] 7.186704e-05 0.008477443 1.000000e+00

library(lme4) # version 1.1-31

Loading required package: Matrix


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

dep_glmer <- glmer(depression ~ diagnose + drug*time + (1|id), 
               data = dat, family = binomial)
summary(dep_glmer, corr = FALSE)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: depression ~ diagnose + drug * time + (1 | id)
   Data: dat

     AIC      BIC   logLik deviance df.resid 
  1173.9   1203.5   -581.0   1161.9     1014 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.2849 -0.8268  0.2326  0.7964  2.0181 

Random effects:
 Groups Name        Variance Std.Dev.
 id     (Intercept) 0.003229 0.05682 
Number of obs: 1020, groups:  id, 340

Fixed effects:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.02796    0.16407  -0.170    0.865    
diagnosesevere -1.31488    0.15261  -8.616  < 2e-16 ***
drugnew        -0.05968    0.22240  -0.268    0.788    
time            0.48273    0.11566   4.174 3.00e-05 ***
drugnew:time    1.01817    0.19150   5.317 1.06e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

library(ggeffects) # version 1.2.0

Warning: package 'ggeffects' was built under R version 4.4.2

plot(ggemmeans(dep_glmer, terms = c("time", "drug"), 
               condition = c(diagnose = "severe"))) + 
  ggplot2::ggtitle("GLMER Effect plot")

You are calculating adjusted predictions on the population-level (i.e.
  `type = "fixed"`) for a *generalized* linear mixed model.
  This may produce biased estimates due to Jensen's inequality. Consider
  setting `bias_correction = TRUE` to correct for this bias.
  See also the documentation of the `bias_correction` argument.

library(emmeans) # version 1.8.4-1

Warning: package 'emmeans' was built under R version 4.4.2

Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'

emm_out <- emmeans(dep_gee2, specs = c("time", "drug"), 
        at = list(diagnose = "severe"), 
        cov.keep = "time", 
        regrid = "response") %>% 
  as.data.frame()
library(ggplot2) # version 3.4.1
ggplot(emm_out) +
  aes(x = time, y = prob, color = drug, fill = drug) +
  geom_line() +
  geom_ribbon(aes(ymin = asymp.LCL, ymax = asymp.UCL, 
                  color = NULL), alpha = 0.15) +
  scale_color_brewer(palette = "Set1") +
  scale_fill_brewer(palette = "Set1", guide = NULL) +
  scale_y_continuous(labels = scales::percent) +
  theme_ggeffects() +
  labs(title = "GEE Effect plot", y = "depression")

library(geepack) # version 1.3.9

Warning: package 'geepack' was built under R version 4.4.2

dep_gee4 <- geeglm(depression ~ diagnose + drug*time,
                   data = dat,
                   id = id,
                   family = binomial,
                   corstr = "exchangeable")
# create effect plot
plot(ggemmeans(dep_gee4, terms = c("time", "drug"),
               condition = c(diagnose = "severe"))) + 
  ggplot2::ggtitle("GEE Effect plot")

References

Additional references on survey data

survey analysis