Skip to content

Commit f47b2f6

Browse files
athowesseabbs
andauthored
Issue 57: Polish get started vignette (#59)
* Move package load, and put data.table comment into ^[] * Remove mention of compare section (merged into model) * Reduce number of code lines a little * Use ref for the table as hoped! * Add primary and secondary sentence * Add text about Figure 2.2 * Add clarification on Figure 1.4 * Improvements to Figure 2.1 * Using gt and dplyr here * Update to use the retrospective data * Improve writing about histograms, and fix colour typo * Downplay censoring less * Rewrite ref:obs-est caption * Add dplyr to Suggests * Update vignettes/epidist.Rmd Co-authored-by: Sam Abbott <s.e.abbott12@gmail.com> --------- Co-authored-by: Sam Abbott <s.e.abbott12@gmail.com> Former-commit-id: 84d65de3087d6cf1d7a82f7b58af2a259e013a0e [formerly 02dd4f2ff5270cb5ec11364382499ee6be1a1e8b] Former-commit-id: e487a3dde30d8c0c28b920687aef5cfbdfd4715c
1 parent 3abb8d8 commit f47b2f6

File tree

2 files changed

+42
-39
lines changed

2 files changed

+42
-39
lines changed

DESCRIPTION

+2-1
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,8 @@ Suggests:
3535
testthat (>= 3.0.0),
3636
readxl,
3737
janitor,
38-
gt
38+
gt,
39+
dplyr
3940
Remotes:
4041
stan-dev/cmdstanr,
4142
Rdatatable/data.table,

vignettes/epidist.Rmd

+40-38
Original file line numberDiff line numberDiff line change
@@ -30,35 +30,36 @@ knitr::opts_chunk$set(
3030
)
3131
```
3232

33-
```{r load-requirements}
34-
library(epidist)
35-
library(data.table)
36-
library(purrr)
37-
library(ggplot2)
38-
```
39-
40-
Many quantities in epidemiology can be thought of as the time between two events, or "delays".
33+
Many quantities in epidemiology can be thought of as the time between two events or "delays".
4134
Important examples include:
4235

4336
* the incubation period (time between infection and symptom onset),
4437
* serial interval (time between symptom onset of infectee and symptom onset of infected), and
4538
* generation interval (time between infection of infectee and infection of infected).
4639

40+
We encompass all of these delays as the time between a "primary event" and "secondary event".
4741
Unfortunately, estimating delays accurately from observational data is usually difficult.
4842
In our experience, the two main issues are:
4943

5044
1. interval censoring, and
5145
2. right truncation.
5246

5347
Don't worry if you've not come across these terms before.
54-
In Section \@ref(data), we will explain what they mean by simulating data like that we might observe during an ongoing infectious disease outbreak.
55-
Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues.
56-
Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth.
48+
First, in Section \@ref(data), we explain interval censoring and right truncation by simulating data like that we might observe during an ongoing infectious disease outbreak.
49+
Then, in Section \@ref(fit), we show how `epidist` can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues.
5750

5851
If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best.
5952

60-
Finally, to run this vignette yourself, you will need the `data.table`, `purrr` and `ggplot2` packages installed.
61-
Note that to work with outputs from `epidist` you do not need to use `data.table`: any tool of your preference is suitable.
53+
To run this vignette yourself, as well as the `epidist` package, you will need the `data.table`^[Note that to work with outputs from `epidist` you do not need to use `data.table`: any tool of your preference is suitable!], `purrr`, `ggplot2`, `gt`, and `dplyr` packages installed.
54+
55+
```{r load-requirements}
56+
library(epidist)
57+
library(data.table)
58+
library(purrr)
59+
library(ggplot2)
60+
library(gt)
61+
library(dplyr)
62+
```
6263

6364
# Example data {#data}
6465

@@ -128,8 +129,7 @@ obs <- outbreak |>
128129
obs[case %% 50 == 0, ] |>
129130
ggplot(aes(y = case)) +
130131
geom_segment(
131-
aes(x = ptime, xend = stime, y = case, yend = case),
132-
col = "grey"
132+
aes(x = ptime, xend = stime, y = case, yend = case), col = "grey"
133133
) +
134134
geom_point(aes(x = ptime), col = "#56B4E9") +
135135
geom_point(aes(x = stime), col = "#009E73") +
@@ -156,7 +156,7 @@ Here we suppose that the interval is daily, meaning that only the date of the pr
156156
obs_cens <- obs |> observe_process()
157157
```
158158

159-
(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals.
159+
(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals.
160160

161161
```{r cens, fig.cap="(ref:cens)"}
162162
ggplot(obs_cens, aes(x = delay, y = delay_daily)) +
@@ -201,29 +201,31 @@ With our censored, truncated, and sampled data, we are now ready to try to recov
201201

202202
# Fit the model and compare estimates {#fit}
203203

204-
If we had access to the data `obs`, then it would be simple to estimate the delay distribution.
204+
If we had access to the complete and unaltered `obs`, it would be simple to estimate the delay distribution.
205205
However, with only access to `obs_cens_trunc_samp`, the delay distribution we observe is biased (Figure \@ref(fig:obs-est)) and we must use a statistical model.
206206

207-
(ref:obs-est) The histogram of delays from `obs` matches closely the underlying distribution, whereas those from `obs_cens_trunc_samp` are systematically biased.
207+
(ref:obs-est) The histogram of delays from the complete, retrospective data `obs_cens` match quite closely with the underlying distribution, whereas those from `obs_cens_trunc_samp` show more significant systematic bias. In this instance the extent of the bias caused by censoring is less than that caused by right truncation. Nonetheless, we always recommend [@charniga2024best; Table 2] adjusting for censoring when it is present.
208208

209209
```{r obs-est, fig.cap="(ref:obs-est)"}
210210
dplyr::bind_rows(
211-
obs |>
212-
dplyr::mutate(type = "Full data") |>
213-
dplyr::select(delay, type),
214-
obs_cens_trunc |>
211+
obs_cens |>
212+
dplyr::mutate(type = "Complete, retrospective data") |>
213+
dplyr::select(delay = delay_daily, type),
214+
obs_cens_trunc_samp |>
215215
dplyr::mutate(type = "Censored, truncated,\nsampled data") |>
216-
dplyr::select(delay, type)
216+
dplyr::select(delay = delay_daily, type)
217217
) |>
218+
dplyr::group_by(type, delay, .drop = FALSE) |>
219+
dplyr::summarise(n = dplyr::n()) |>
220+
dplyr::mutate(p = n / sum(n)) |>
218221
ggplot() +
219-
geom_histogram(
220-
aes(x = delay, y = ..density.., fill = type, group = type),
221-
position = "dodge"
222+
geom_col(
223+
aes(x = delay, y = p, fill = type, group = type),
224+
position = position_dodge2(preserve = "single")
222225
) +
223-
scale_fill_manual(values = c("#0072B2", "#D55E00")) +
226+
scale_fill_manual(values = c("#CC79A7", "#0072B2")) +
224227
geom_function(
225-
data = data.frame(x = c(0, 30)),
226-
aes(x = x),
228+
data = data.frame(x = c(0, 30)), aes(x = x),
227229
fun = dlnorm,
228230
args = list(
229231
meanlog = secondary_dist[["meanlog"]],
@@ -254,15 +256,15 @@ The `fit` object is a `brmsfit` object containing MCMC samples from each of the
254256
Users familiar with Stan and `brms`, can work with `fit` directly.
255257
Any tool that supports `brms` fitted model objects will be compatible with `fit`.
256258

259+
(ref:pars) All of the parameters that are included in the model. Many of these parameters (e.g. `swindow` and `pwindow`) the so called latent variables in the model, and have lengths corresponding to the `sample_size`.
260+
257261
```{r pars}
258262
pars <- fit$fit@par_dims |>
259263
map(.f = function(x) if (identical(x, integer(0))) return(1) else return(x))
260264
261265
data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |>
262-
gt::gt(caption =
263-
"All of the parameters that are included in the model.
264-
Many of these parameters are the so called latent variables in the model."
265-
)
266+
gt::gt() |>
267+
gt::tab_caption("(ref:pars)")
266268
```
267269

268270
The `epidist` package also provides functions to make common post-processing tasks easy.
@@ -272,9 +274,10 @@ For example, samples of the fitted lognormal `meanlog` and `sdlog` parameter can
272274
draws <- extract_lognormal_draws(fit)
273275
```
274276

275-
Figure \@ref(fig:fitted-lognormal) shows...
277+
Figure \@ref(fig:fitted-lognormal) shows the lognormal delay distribution obtained using the average of the `meanlog` and `sdlog` draws.
278+
Whereas in Figure \@ref(fig:obs-est) the histogram of censored, truncated, sampled data was substantially different to the underlying delay distribution, using `epidist` we have obtained a much closer match to the truth.
276279

277-
(ref:fitted-lognormal) Figure caption.
280+
(ref:fitted-lognormal) A fitted delay distribution (in pink) as compared with the true delay distribution (in black).
278281

279282
```{r fitted-lognormal, fig.cap="(ref:fitted-lognormal)"}
280283
ggplot() +
@@ -289,13 +292,12 @@ ggplot() +
289292
) +
290293
geom_function(
291294
data = data.frame(x = c(0, 30)),
292-
aes(x = x),
293-
fun = dlnorm,
295+
aes(x = x), fun = dlnorm,
294296
args = list(
295297
meanlog = mean(draws$meanlog),
296298
sdlog = mean(draws$sdlog)
297299
),
298-
col = "#D55E00"
300+
col = "#CC79A7"
299301
) +
300302
labs(
301303
x = "Delay between primary and secondary event (days)",

0 commit comments

Comments
 (0)