-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter4.qmd
551 lines (442 loc) · 17.9 KB
/
chapter4.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
---
title: "Chapter 4 - Cox Proportional Hazards Regression"
---
$$
\def\T{\mathrm{T}}
$$
## Slides
Lecture slides [here](chap4.html){target="_blank"}. (To convert html to pdf, press E $\to$ Print $\to$ Destination: Save to pdf)
## Chapter Summary
The Cox proportional hazards model is a most popular semiparametric regression model for time-to-event data. The partial likelihood provides a key tool for making inferences on parametric covariate effects in the presence of a nonparametric time trend.
### Basic model specification
The Cox proportional hazards model expresses each subject’s hazard rate as a product of a baseline function and an exponential term involving covariates. For a subject with covariates $Z$, the hazard at time $t$ is
$$
\lambda(t \mid Z) = \lambda_0(t)\exp(\beta^{\rm T} Z).
$$
This decomposition leaves $\lambda_0(t)$ fully unspecified, focusing on the log-hazard coefficients $\beta$. Under proportional hazards, any two covariate sets differ by a constant hazard ratio over time, enabling a direct interpretation of $\exp(\beta_k)$ as the risk multiplier per unit increase in the $k$-th covariate.
### The partial-likelihood approach
Estimation of $\beta$ proceeds through a partial-likelihood function that isolates each event’s contribution while conditioning on the subjects at risk. Suppose the observed sample has unique failure times $t_1 < \dots < t_m$, each with a single failing subject denoted by $(j)$. Let $\mathcal{R}_j$ be the risk set at $t_j$, meaning the indices of all subjects still under observation just before $t_j$.
{fig-align="center" width="70%"}
Then the Cox partial likelihood is
$$
PL(\beta)
=
\prod_{j=1}^m
\frac{
\exp(\beta^\T Z_{(j)})
}{
\sum_{i \in \mathcal{R}_j}
\exp(\beta^\T Z_i)
},
$$
where $Z_{(j)}$ is the covariate vector for the subject who fails at $t_j$. This construction eliminates $\lambda_0(t)$ from the likelihood by focusing on how the failures are “allocated” among those at risk.
Taking logarithms leads to the log-partial likelihood,
$$
\log PL(\beta)
=
\sum_{j=1}^m
\left[
\beta^\T Z_{(j)}
\;-\;
\log \sum_{i \in \mathcal{R}_j}
\exp\{\beta^\T Z_i\}
\right].
$$
We then differentiate to obtain the partial-likelihood score function,
$$
U(\beta)
=
\frac{\partial}{\partial \beta}
\log PL(\beta)
=
\sum_{i=1}^n
\delta_i
\Biggl[
Z_i
-
\frac{
\sum_{j=1}^n
I(X_j \ge X_i)\,
Z_j\,
\exp(\beta^\T Z_j)
}{
\sum_{j=1}^n
I(X_j \ge X_i)\,
\exp(\beta^\T Z_j)
}
\Biggr],
$$
which is set to zero to solve for the maximum partial-likelihood estimator $\hat{\beta}$. The resulting $\hat{\beta}$ inherits many large-sample properties from classical likelihood theory, including approximate normality and consistent variance estimation. Standard errors, Wald tests, and confidence intervals for hazard ratios $\exp(\hat{\beta}_k)$ are then derived in a manner analogous to parametric models.
### Residual-based diagnostics
Model adequacy is assessed via residuals that reveal potential violations of proportional hazards or mis-specification of covariate functional forms.
- **Cox–Snell residuals** transform observed times and event indicators so that, under correct modeling, they should behave like censored samples from an exponential distribution with unit mean.
- **Schoenfeld residuals** target the time constancy of each covariate’s hazard ratio. If a covariate’s effect changes over time, its Schoenfeld residuals display systematic trends.
- **Martingale and deviance residuals** detect nonlinearity or outliers in continuous covariates by measuring how observed outcomes differ from fitted values. Plotting these against each covariate often shows whether transformations or alternative model structures are needed.
### Time-varying covariates
Some covariates evolve during follow-up (internal) or follow external factors independent of the subject’s health status (external). A time-varying $Z(t)$ replaces the static $Z$ in the hazard model:
$$
\lambda(t \mid Z(\cdot)) = \lambda_0(t)\exp\{\beta^\T Z(t)\}.
$$
For internal covariates, the hazard depends on the covariate history only through its most recent value. Estimation still uses the same partial-likelihood framework, but data must be organized in “start–stop” segments that reflect changing covariate levels over time.
### Statistical properties via martingale
Large-sample properties of the partial-likelihood estimator follow from writing the score function as a sum of martingale integrals. This perspective simplifies the derivation of asymptotic variance formulas for $\hat{\beta}$ and helps justify robust inferences under independent censoring. The Breslow estimator of the baseline hazard, $\hat{\Lambda}_0(t)$, also arises through martingale arguments, matching how the Nelson–Aalen estimator can be viewed for simpler nonparametric settings.
### Software use
Fitting a Cox model in R can be done with just a few commands from the `survival` package:
```{r}
#| eval: false
library(survival)
# Example data frame 'mydata' with columns:
# time: Event or censoring time
# status: Event indicator (1 if event observed, 0 if censored)
# x1, x2: Covariates
# group: A grouping variable (e.g., for stratification)
# 1. Fit a basic Cox model
obj <- coxph(Surv(time, status) ~ x1 + x2, data = mydata)
# 2. Summarize the fitted model
summary(obj)
# This provides coefficient estimates (log-hazard), hazard ratios (exp(coef)),
# standard errors, confidence intervals, and p-values.
# 3. Obtain Breslow estimates of the baseline cumulative hazard
base_haz <- basehaz(obj, centered = FALSE)
head(base_haz)
# 4. Extract residuals for diagnostics
mart_res <- residuals(obj, type = "martingale") # Martingale residuals
sch <- cox.zph(obj) # Schoenfeld residuals and tests
# 5. Fit a stratified Cox model
obj_str <- coxph(Surv(time, status) ~ x1 + x2 + strata(group), data = mydata)
# 6. Fit a Cox model with time-varying covariates
# (Requires data in long "start-stop" format)
obj_tv <- coxph(Surv(start, stop, event) ~ x1 + tv_cov, data = tv_data)
```
Beyond these base functions, other packages offer enhanced output and plots:
- `gtsummary` can produce formatted tables of estimated coefficients, hazard ratios, and confidence intervals in a single step using `tbl_regression(obj, exponentiate = TRUE)`.
- `survminer` provides visualization tools: `ggcoxzph(cox.zph(obj))` to diagnose proportional hazards via Schoenfeld residuals, and `ggforest(obj)` to show a forest plot of hazard ratios.
- `ggsurvfit` simplifies the creation of Kaplan–Meier curves alongside at-risk tables, p-values, and confidence intervals.
### Conclusion
The Cox proportional hazards model accommodates flexible, nonparametric baseline hazards while preserving a straightforward regression interpretation of covariate effects. Partial-likelihood estimation, together with various model-based residuals, allows rigorous inferences of
covariate effects and time trends. Time-varying covariates further expand its applicability to complex longitudinal settings. By combining robust inference with interpretability, the Cox model remains a central tool in regression-based survival analysis.
## R Code
```{r}
#| code-fold: true
#| code-summary: "Show the code"
#| eval: false
###############################################################################
# Chapter 4 R Code
#
# This script reproduces all major numerical results in Chapter 4, including:
# 1. A Cox proportional hazards model on the German Breast Cancer (GBC) study
# (Figure 4.2, Table 4.1)
# 2. Residual analysis and model diagnostics (Figures 4.3–4.6, Tables 4.2–4.4)
# 3. Cox model with time-varying covariates for Stanford Heart study (Table 4.6)
###############################################################################
#==============================================================================
# (A) Cox Proportional Hazards Model on GBC Data
# (Section 4.2.6, Figure 4.2)
#==============================================================================
library(survival)
library(tidyverse) # for data manipulation and visualization
#------------------------------------------------------------------------------
# 1. Read GBC data, subset to first events, set up for Cox PH
#------------------------------------------------------------------------------
gbc <- read.table("Data//German Breast Cancer Study//gbc.txt")
# Sort the data by time within each id
o <- order(gbc$id, gbc$time)
gbc <- gbc[o,]
# Keep the first row for each id (first event)
df <- gbc[!duplicated(gbc$id), ]
# Set status = 1 if status == 2 or 1
df$status <- (df$status > 0) + 0
#------------------------------------------------------------------------------
# 2. Fit the Cox proportional hazards model
#------------------------------------------------------------------------------
# Convert categorical variables to factors
df$hormone <- factor(df$hormone)
df$meno <- factor(df$meno)
df$grade <- factor(df$grade)
# Reduce the scales of prog and estrg by 100
df$prog <- df$prog / 100
df$estrg <- df$estrg / 100
obj <- coxph(
Surv(time, status) ~ hormone + meno + age + size + grade + prog + estrg,
data = df
)
# beta estimates
obj$coefficients
# variance matrix
obj$var
# Summarize results
summary(obj)
# Wald test on tumor grade (H_0: beta_grade2 = beta_grade3 = 0)
beta_q <- obj$coefficients[5:6]
Sigma_q <- obj$var[5:6, 5:6]
chisq2 <- t(beta_q) %*% solve(Sigma_q) %*% beta_q
pval <- 1 - pchisq(chisq2, 2)
pval
#------------------------------------------------------------------------------
# 3. Breslow estimates of the baseline cumulative hazard
#------------------------------------------------------------------------------
Lambda0 <- basehaz(obj, centered = FALSE)
# Plot the baseline hazard function
plot(
stepfun(Lambda0$time, c(0, Lambda0$hazard)),
do.points = FALSE,
cex.axis = 0.9,
lwd = 2,
frame.plot = FALSE,
xlim = c(0, 100),
ylim = c(0, 0.8),
xlab = "Time (months)",
ylab = "Baseline cumulative hazard",
main = ""
)
#------------------------------------------------------------------------------
# 4. Enhanced tabulation via gtsummary
#------------------------------------------------------------------------------
library(gtsummary)
obj <- coxph(
Surv(time, status) ~ hormone + meno + age + size + grade + prog + estrg,
data = df
)
tbl <- tbl_regression(
obj,
exponentiate = TRUE,
label = list(
hormone = "Hormone",
meno = "Menopause",
age = "Age (years)",
size = "Tumor size (mm)",
grade = "Tumor grade (1-3)",
prog = "Progesterone (fmol/mg)",
estrg = "Estrogen (fmol/mg)"
)
)
tbl
#------------------------------------------------------------------------------
# 5. Forest plot via survminer
#------------------------------------------------------------------------------
library(survminer)
ggforest(obj, data = df)
#==============================================================================
# (B) Residual Analysis (Figures 4.3–4.6)
#==============================================================================
############################
# Cox-Snell residuals, etc.
############################
#------------------------------------------------------------------------------
# 1. Cox-Snell residuals
#------------------------------------------------------------------------------
coxsnellres <- df$status - resid(obj, type = "martingale")
# Then use the N–A method to estimate the cumulative hazard of residuals
fit <- survfit(Surv(coxsnellres, df$status) ~ 1)
Htilde <- cumsum(fit$n.event / fit$n.risk)
par(mfrow = c(1, 1))
plot(
log(fit$time), log(Htilde),
xlab = "log(t)",
ylab = "log-cumulative hazard",
frame.plot = FALSE,
xlim = c(-8, 2),
ylim = c(-8, 2)
)
abline(0, 1, lty = 3, lwd = 1.5)
# ggplot version of Cox-Snell
tibble(
time = fit$time,
Htilde = Htilde
) %>%
ggplot(aes(x = log(time), y = log(Htilde))) +
geom_point(size = 2) +
geom_abline(intercept = 0, slope = 1, linetype = "dashed", linewidth = 1) +
xlab("log(t)") +
ylab("Log-cumulative hazard") +
theme_minimal() +
coord_fixed()
ggsave("images/cox_cox_snell.png", width = 4, height = 4)
ggsave("images/cox_cox_snell.eps", width = 4, height = 4)
#------------------------------------------------------------------------------
# 2. Schoenfeld residuals (test proportional hazards)
#------------------------------------------------------------------------------
sch <- cox.zph(obj)
sch$table # p-values for each covariate and the global test
# Base R plots
par(mar = c(4, 4, 2, 2), mfrow = c(4, 2))
plot(
sch,
xlab = "Time (months)",
lwd = 2,
cex.lab = 1.2,
cex.axis= 1.2,
ylab = c("Hormone", "Menopause", "Age", "Tumor size",
"Tumor grade", "Progesterone", "Estrogen")
)
# survminer version
ggcoxzph(sch)
#------------------------------------------------------------------------------
# 3. Re-fit with stratification by tumor grade
#------------------------------------------------------------------------------
obj_stra <- coxph(
Surv(time, status) ~ hormone + meno + age + size + prog + estrg + strata(grade),
data = df
)
# Check PH with stratification
sch_stra <- cox.zph(obj_stra)
round(sch_stra$table, 4)
ggcoxzph(sch_stra)
# Residual plots for stratified model
par(mfrow = c(3, 2))
plot(
sch_stra,
xlab = "Time (months)",
lwd = 2,
cex.lab = 1.2,
cex.axis = 1.2,
ylab = c("Hormone", "Menopause", "Age", "Tumor size",
"Progesterone", "Estrogen")
)
#------------------------------------------------------------------------------
# 4. Martingale/deviance residuals
#------------------------------------------------------------------------------
mart_resid <- resid(obj_stra, type = "martingale")
dev_resid <- resid(obj_stra, type = "deviance")
# Plot martingale residuals vs. quantitative covariates
par(mfrow = c(2, 2))
# Age
plot(
df$age, mart_resid,
xlab = "Age (years)",
ylab = "Martingale residuals",
main = "Age",
cex.lab = 1.2,
cex.axis= 1.2
)
lines(lowess(df$age, mart_resid), lwd = 2)
abline(0, 0, lty = 3, lwd = 2)
# Tumor size
plot(
df$size, mart_resid,
xlab = "Tumor size (mm)",
ylab = "Martingale residuals",
main = "Tumor size",
cex.lab = 1.2,
cex.axis= 1.2
)
lines(lowess(df$size, mart_resid), lwd = 2)
abline(0, 0, lty = 3, lwd = 2)
# Progesterone
plot(
df$prog, mart_resid,
xlab = "Progesterone receptor (100 fmol/mg)",
ylab = "Martingale Residuals",
main = "Progesterone",
cex.lab = 1.2,
cex.axis= 1.2
)
lines(lowess(df$prog, mart_resid), lwd = 2)
abline(0, 0, lty = 3, lwd = 2)
# Estrogen
plot(
df$estrg, mart_resid,
xlab = "Estrogen receptor (100 fmol/mg)",
ylab = "Martingale Residuals",
main = "Estrogen",
cex.lab = 1.2,
cex.axis= 1.2
)
lines(lowess(df$estrg, mart_resid), lwd = 2)
abline(0, 0, lty = 3, lwd = 2)
#------------------------------------------------------------------------------
# 5. Addressing non-linear age effect
#------------------------------------------------------------------------------
df$agec <- (df$age <= 40) + 2 * (df$age > 40 & df$age <= 60) + 3 * (df$age > 60)
df$agec <- factor(df$agec)
obj_stra_final <- coxph(
Surv(time, status) ~ hormone + meno + agec + size + prog + estrg + strata(grade),
data = df
)
final_sum <- summary(obj_stra_final)
final_sum
# Tabulate final model results
tbl_final <- tbl_regression(
obj_stra_final,
exponentiate = TRUE,
label = list(
hormone = "Hormone",
meno = "Menopause",
agec = "Age group",
size = "Tumor size (mm)",
prog = "Progesterone (100 fmol/mg)",
estrg = "Estrogen (100 fmol/mg)"
)
)
# Print table
tbl_final
# Plot the age-group-specific HR and confidence intervals (Figure 4.6)
ci.table <- final_sum$conf.int
hr <- ci.table[3:4, 1]
hr.low <- ci.table[3:4, 3]
hr.up <- ci.table[3:4, 4]
par(mfrow = c(1, 1))
plot(
1:3, c(1, hr),
ylim = c(0, 1.2),
frame = FALSE,
xaxt = 'n',
xlab = "Age (years)",
ylab = "Hazard ratio",
pch = 19,
cex = 1.5,
cex.lab = 1.2,
cex.axis = 1.2
)
axis(1, at = c(1, 2, 3), labels = c("(20, 40]", "(40, 60]", "(60, 80]"), cex.axis = 1.2)
arrows(2:3, hr.low, 2:3, hr.up, length = 0.05, angle = 90, code = 3, lwd = 2)
lines(1:3, c(1, hr), lty = 3, lwd = 2)
# ggplot version
tibble(
age_group = c("(20, 40]", "(40, 60]", "(60, 80]"),
HR = c(1, hr),
HR_low = c(1, hr.low),
HR_up = c(1, hr.up)
) %>%
ggplot(aes(x = age_group, y = HR)) +
geom_point(size = 3) +
geom_errorbar(aes(ymin = HR_low, ymax = HR_up), width = 0.2, linewidth = 0.8) +
geom_line(group = 1, linetype = "dashed", linewidth = 0.8) +
xlab("Age group (years)") +
ylab("Hazard ratio") +
theme_minimal() +
scale_y_log10(
limits = c(0.125, 1.05),
breaks = c(0.125, 0.25, 0.5, 0.75, 1),
labels = scales::label_percent(accuracy = 1)
) +
theme(axis.text = element_text(size = 10))
ggsave("images/cox_nonlinear_age.png", width = 7.5, height = 4)
ggsave("images/cox_nonlinear_age.eps", width = 7.5, height = 4)
#==============================================================================
# (C) Illustration with Time-Varying Covariates
# (Stanford Heart Study)
#==============================================================================
head(heart)
# Rename variable "year" -> "accpt"
colnames(heart)[5] <- "accpt"
# Sample size
n <- length(unique(heart$id))
#------------------------------------------------------------------------------
# 1. Fit a Cox model with a time-dependent "transplant" variable
#------------------------------------------------------------------------------
obj <- coxph(Surv(start, stop, event) ~ age + accpt + surgery + transplant, data = heart)
summary(obj)
# Tabulate using gtsummary
heart_tbl <- tbl_regression(
obj,
exponentiate = TRUE,
label = list(
age = "Age (years)",
accpt = "Acceptance",
surgery = "Surgery",
transplant= "Transplant"
)
)
heart_tbl
```