Skip to content

Commit

Permalink
Issue 57: Polish get started vignette (#59)
Browse files Browse the repository at this point in the history
* Move package load, and put data.table comment into ^[]

* Remove mention of compare section (merged into model)

* Reduce number of code lines a little

* Use ref for the table as hoped!

* Add primary and secondary sentence

* Add text about Figure 2.2

* Add clarification on Figure 1.4

* Improvements to Figure 2.1

* Using gt and dplyr here

* Update to use the retrospective data

* Improve writing about histograms, and fix colour typo

* Downplay censoring less

* Rewrite ref:obs-est caption

* Add dplyr to Suggests

* Update vignettes/epidist.Rmd

Co-authored-by: Sam Abbott <s.e.abbott12@gmail.com>

---------

Co-authored-by: Sam Abbott <s.e.abbott12@gmail.com>
Former-commit-id: 980991b
Former-commit-id: def0c7a7d77cc490c9dacd694702162e5f2e27ce
Former-commit-id: ada96fedcc73b5f5e78259dbd55b3d16434c604b [formerly 92ea6a2]
Former-commit-id: 116b162375f492fcd69c3cda9a187defee3a4eac
  • Loading branch information
athowes and seabbs authored May 30, 2024
1 parent ea86d50 commit 2ebe008
Show file tree
Hide file tree
Showing 2 changed files with 42 additions and 39 deletions.
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ Suggests:
testthat (>= 3.0.0),
readxl,
janitor,
gt
gt,
dplyr
Remotes:
stan-dev/cmdstanr,
Rdatatable/data.table,
Expand Down
78 changes: 40 additions & 38 deletions vignettes/epidist.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,35 +30,36 @@ knitr::opts_chunk$set(
)
```

```{r load-requirements}
library(epidist)
library(data.table)
library(purrr)
library(ggplot2)
```

Many quantities in epidemiology can be thought of as the time between two events, or "delays".
Many quantities in epidemiology can be thought of as the time between two events or "delays".
Important examples include:

* the incubation period (time between infection and symptom onset),
* serial interval (time between symptom onset of infectee and symptom onset of infected), and
* generation interval (time between infection of infectee and infection of infected).

We encompass all of these delays as the time between a "primary event" and "secondary event".
Unfortunately, estimating delays accurately from observational data is usually difficult.
In our experience, the two main issues are:

1. interval censoring, and
2. right truncation.

Don't worry if you've not come across these terms before.
In Section \@ref(data), we will explain what they mean by simulating data like that we might observe during an ongoing infectious disease outbreak.
Next, in Section \@ref(fit), we show how `epidist` can be used to estimate delays using a statistical model which properly accounts for these two issues.
Finally, in Section \@ref(compare), we demonstrate that the fitted delay distribution accurately recovers the underlying truth.
First, in Section \@ref(data), we explain interval censoring and right truncation by simulating data like that we might observe during an ongoing infectious disease outbreak.
Then, in Section \@ref(fit), we show how `epidist` can be used to accurately estimate delay distributions by using a statistical model which properly accounts for these two issues.

If you would like more technical details, the `epidist` package implements models following best practices as described in @park2024estimating and @charniga2024best.

Finally, to run this vignette yourself, you will need the `data.table`, `purrr` and `ggplot2` packages installed.
Note that to work with outputs from `epidist` you do not need to use `data.table`: any tool of your preference is suitable.
To run this vignette yourself, as well as the `epidist` package, you will need the `data.table`^[Note that to work with outputs from `epidist` you do not need to use `data.table`: any tool of your preference is suitable!], `purrr`, `ggplot2`, `gt`, and `dplyr` packages installed.

```{r load-requirements}
library(epidist)
library(data.table)
library(purrr)
library(ggplot2)
library(gt)
library(dplyr)
```

# Example data {#data}

Expand Down Expand Up @@ -128,8 +129,7 @@ obs <- outbreak |>
obs[case %% 50 == 0, ] |>
ggplot(aes(y = case)) +
geom_segment(
aes(x = ptime, xend = stime, y = case, yend = case),
col = "grey"
aes(x = ptime, xend = stime, y = case, yend = case), col = "grey"
) +
geom_point(aes(x = ptime), col = "#56B4E9") +
geom_point(aes(x = stime), col = "#009E73") +
Expand All @@ -156,7 +156,7 @@ Here we suppose that the interval is daily, meaning that only the date of the pr
obs_cens <- obs |> observe_process()
```

(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals.
(ref:cens) Interval censoring of the primary and secondary event times obscures the delay times. A common example of this is when events are reported as daily aggregates. While daily censoring is most common, `epidist` supports the primary and secondary events having other delay intervals.

```{r cens, fig.cap="(ref:cens)"}
ggplot(obs_cens, aes(x = delay, y = delay_daily)) +
Expand Down Expand Up @@ -201,29 +201,31 @@ With our censored, truncated, and sampled data, we are now ready to try to recov

# Fit the model and compare estimates {#fit}

If we had access to the data `obs`, then it would be simple to estimate the delay distribution.
If we had access to the complete and unaltered `obs`, it would be simple to estimate the delay distribution.
However, with only access to `obs_cens_trunc_samp`, the delay distribution we observe is biased (Figure \@ref(fig:obs-est)) and we must use a statistical model.

(ref:obs-est) The histogram of delays from `obs` matches closely the underlying distribution, whereas those from `obs_cens_trunc_samp` are systematically biased.
(ref:obs-est) The histogram of delays from the complete, retrospective data `obs_cens` match quite closely with the underlying distribution, whereas those from `obs_cens_trunc_samp` show more significant systematic bias. In this instance the extent of the bias caused by censoring is less than that caused by right truncation. Nonetheless, we always recommend [@charniga2024best; Table 2] adjusting for censoring when it is present.

```{r obs-est, fig.cap="(ref:obs-est)"}
dplyr::bind_rows(
obs |>
dplyr::mutate(type = "Full data") |>
dplyr::select(delay, type),
obs_cens_trunc |>
obs_cens |>
dplyr::mutate(type = "Complete, retrospective data") |>
dplyr::select(delay = delay_daily, type),
obs_cens_trunc_samp |>
dplyr::mutate(type = "Censored, truncated,\nsampled data") |>
dplyr::select(delay, type)
dplyr::select(delay = delay_daily, type)
) |>
dplyr::group_by(type, delay, .drop = FALSE) |>
dplyr::summarise(n = dplyr::n()) |>
dplyr::mutate(p = n / sum(n)) |>
ggplot() +
geom_histogram(
aes(x = delay, y = ..density.., fill = type, group = type),
position = "dodge"
geom_col(
aes(x = delay, y = p, fill = type, group = type),
position = position_dodge2(preserve = "single")
) +
scale_fill_manual(values = c("#0072B2", "#D55E00")) +
scale_fill_manual(values = c("#CC79A7", "#0072B2")) +
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
data = data.frame(x = c(0, 30)), aes(x = x),
fun = dlnorm,
args = list(
meanlog = secondary_dist[["meanlog"]],
Expand Down Expand Up @@ -254,15 +256,15 @@ The `fit` object is a `brmsfit` object containing MCMC samples from each of the
Users familiar with Stan and `brms`, can work with `fit` directly.
Any tool that supports `brms` fitted model objects will be compatible with `fit`.

(ref:pars) All of the parameters that are included in the model. Many of these parameters (e.g. `swindow` and `pwindow`) the so called latent variables in the model, and have lengths corresponding to the `sample_size`.

```{r pars}
pars <- fit$fit@par_dims |>
map(.f = function(x) if (identical(x, integer(0))) return(1) else return(x))
data.frame("Parameter" = names(pars), "Length" = unlist(pars)) |>
gt::gt(caption =
"All of the parameters that are included in the model.
Many of these parameters are the so called latent variables in the model."
)
gt::gt() |>
gt::tab_caption("(ref:pars)")
```

The `epidist` package also provides functions to make common post-processing tasks easy.
Expand All @@ -272,9 +274,10 @@ For example, samples of the fitted lognormal `meanlog` and `sdlog` parameter can
draws <- extract_lognormal_draws(fit)
```

Figure \@ref(fig:fitted-lognormal) shows...
Figure \@ref(fig:fitted-lognormal) shows the lognormal delay distribution obtained using the average of the `meanlog` and `sdlog` draws.
Whereas in Figure \@ref(fig:obs-est) the histogram of censored, truncated, sampled data was substantially different to the underlying delay distribution, using `epidist` we have obtained a much closer match to the truth.

(ref:fitted-lognormal) Figure caption.
(ref:fitted-lognormal) A fitted delay distribution (in pink) as compared with the true delay distribution (in black).

```{r fitted-lognormal, fig.cap="(ref:fitted-lognormal)"}
ggplot() +
Expand All @@ -289,13 +292,12 @@ ggplot() +
) +
geom_function(
data = data.frame(x = c(0, 30)),
aes(x = x),
fun = dlnorm,
aes(x = x), fun = dlnorm,
args = list(
meanlog = mean(draws$meanlog),
sdlog = mean(draws$sdlog)
),
col = "#D55E00"
col = "#CC79A7"
) +
labs(
x = "Delay between primary and secondary event (days)",
Expand Down

0 comments on commit 2ebe008

Please sign in to comment.