Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 35: Draft of methods #37

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Issue 35: Draft of methods #37

wants to merge 11 commits into from

Conversation

kaitejohnson
Copy link
Collaborator

@kaitejohnson kaitejohnson commented Feb 12, 2025

Description

This PR closes #35.

This is a write-up of the methods being implemented to:

  • estimate a delay distribution
    • from a complete reporting matrix
    • using imputation of the point nowcast
  • estimate a point nowcast from an incomplete reporting triangle and delay distribution
  • estimate observation error in a nowcast from a complete or partially observed reporting triangle.
    • iteratively and retrospectively re-estimate delay distribution and re-compute nowcasts, in order to simulate the out-of-sample predictive error that would have been made in the past

Note @seabbs This could benefit from #34 to preview the new article but I am stuck on that 404 error (which I think is from the different website name than what is expected, but not sure since website is working fine in main)

Checklist

  • My PR is based on a package issue and I have explicitly linked it.
  • I have included the target issue or issues in the PR title in the for Issue(s) issue-numbers: PR title
  • I have read the contribution guidelines.
  • I have tested my changes locally.
  • I have added or updated unit tests where necessary.
  • I have updated the documentation if required.
  • My code follows the established coding standards.
  • I have added a news item linked to this PR.
  • I have reviewed CI checks for this PR and addressed them as far as I am able.

Copy link

codecov bot commented Feb 17, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.44%. Comparing base (5d23bd5) to head (1ae52c0).
Report is 37 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #37      +/-   ##
==========================================
+ Coverage   91.48%   99.44%   +7.95%     
==========================================
  Files           5        6       +1     
  Lines          94      180      +86     
==========================================
+ Hits           86      179      +93     
+ Misses          8        1       -7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kaitejohnson kaitejohnson marked this pull request as ready for review February 17, 2025 11:55
@kaitejohnson kaitejohnson requested a review from seabbs February 17, 2025 11:56
@seabbs
Copy link
Collaborator

seabbs commented Feb 20, 2025

method 2: iteratively and retrospectively re-compute nowcasts using the same single point estimate of the delay distribution

Why is this permutation in here? Is it because we plan to support it or otherwise?

Copy link
Collaborator

@seabbs seabbs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review more coming shortly

%\VignetteEncoding{UTF-8}
---

# `baselinenowcast` mathematical model
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic point but dont really need the package name in the docs of the package?


# `baselinenowcast` mathematical model

The following describes the estimate of the delay distribution, the generation of the point nowcast, and the estimate of the observation error, for a partially observed or complete reporting triangle. The method assumes that the units of the delays, $d$\$ and the units of reference time $t$ are the same, e.g. weekly data and weekly releases or daily data with daily releases. This method is based on the method described by ([Wolffram et al. 2023](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011394)) developed by the Karlsruhe Institute of Technology.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we spoke about at the conference it more assumes that the reporting triangle is contiguous doesn't it? So you could make it support daily and weekly etc without changing the method but by just adding in preprocessing to the reporting triangle?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed f2f, will reword this to explain that the method is time-unit agnostic, and so it is flexible enough to handle any combination of reference and reporting date, but will need to be pre-processed such that the units of both are the same...


# `baselinenowcast` mathematical model

The following describes the estimate of the delay distribution, the generation of the point nowcast, and the estimate of the observation error, for a partially observed or complete reporting triangle. The method assumes that the units of the delays, $d$\$ and the units of reference time $t$ are the same, e.g. weekly data and weekly releases or daily data with daily releases. This method is based on the method described by ([Wolffram et al. 2023](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011394)) developed by the Karlsruhe Institute of Technology.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following describes the estimate of the delay distribution, the generation of the point nowcast, and the estimate of the observation error, for a partially observed or complete reporting triangle. The method assumes that the units of the delays, $d$\$ and the units of reference time $t$ are the same, e.g. weekly data and weekly releases or daily data with daily releases. This method is based on the method described by ([Wolffram et al. 2023](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011394)) developed by the Karlsruhe Institute of Technology.
The following describes the estimate of the delay distribution, the generation of the point nowcast, and the estimate of the observation error, for a partially observed or complete reporting triangle. The method assumes that the units of the delays, $d$\$ and the units of reference time $t$ are the same, e.g. weekly data and weekly releases or daily data with daily releases. This method is based on the method described by ([Wolffram et al. 2023](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011394)).


The following describes the estimate of the delay distribution, the generation of the point nowcast, and the estimate of the observation error, for a partially observed or complete reporting triangle. The method assumes that the units of the delays, $d$\$ and the units of reference time $t$ are the same, e.g. weekly data and weekly releases or daily data with daily releases. This method is based on the method described by ([Wolffram et al. 2023](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1011394)) developed by the Karlsruhe Institute of Technology.

### Notation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before getting going with it I really like the idea of a high level overview of the general approach in a single paragraph. Perhaps with some refs to similar approaches and pointing out what it is not similar to. Perhaps this needs to go elsewhere though vs here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could intro this with a schematic using the reporting triangle and highlighting the "blocks" being computed in the matrix? I feel like there might have been a nice slide on this in one of the talks, or can make one myself.

### Point estimate of the delay distribution

We use the entire reporting triangle to compute an empirical estimate of the delay distribution, $\pi(d)$, or the probability that a case at reference time $t$ appears in the dataset at time $t + d$. We will refer to the realized empirical estimate of the delay distribution from a reporting triangle as $\pi(d)$.
The delay distribution, $\pi(d)$ can be estimated directly from the completed reporting matrix $X$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am. a big fan of always describing it in english as well as equations.

{\pi}_d= \frac{\sum_{t=1}^{t=t^*} X_{t,d}}{\sum_{d=0}^{D} \sum_{t=1}^{t=t^*} X_{t,d}}
$$

In the special case when the time the estimate is made $t'$, is beyond the data release time $t^{*}$, such that $t' \ge t^* + D$, $\hat{\pi}_d$ can be computed directly by summing over all reference time points $t$ at each delay $d$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you reword I think you just me that t prime is greater than the the release time + the maximum delay as the maths says i.e it is fully reported

$$

In the special case when the time the estimate is made $t'$, is beyond the data release time $t^{*}$, such that $t' \ge t^* + D$, $\hat{\pi}_d$ can be computed directly by summing over all reference time points $t$ at each delay $d$.
In the case where there are partial observations, in order to properly weight the denominator with the missing delays, we have to first impute the cases $\hat{x}_{t,d}$ for all instances where $t+d > t^*$. This amounts to computing the point nowcast from the partial reporting triangle.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head this justification should be in the next section as that is what it is about?

To do so, we start by defining $\theta_d$, which is the factor by which the cases on delay $d$ compare to the total cases through delay $d-1$, obtained from $N$ preceding rows of the triangle. In practice, $N \ge D$, with any $N > D$ representing the number of completed observations used to inform the estimate

$$
\hat{\theta}_d(t^*) = \frac{\sum_{i=1}^{N} x_{t^*-i+d, d}}{\sum_{d=1}^{d-1} \sum_{i=1}^{N} x_{t^*-i+d,d}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notationally all the indexes here feel a bit comlex for what it is. I dont have any great thoughts on simplification though

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this was where I was struggling a bit last week, its way more complex in math notation than in just looking at the code and the blocks of the matrix being summed over...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering if you maybe redefine the sum?

\hat{\theta}_d(t^*) = \frac{\sum_{i=1}^{N} x_{t^*-i+d, d}}{\sum_{d=1}^{d-1} \sum_{i=1}^{N} x_{t^*-i+d,d}}
$$

*Note:* this amounts to taking the sum of the elements in column $d$ up until time $t*-d$ and dividing by the sum over all the elements to the left of column $d$ up until time $t*-d$, referred to as `block_top` and `block_top_left`, respectively, in the code.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*Note:* this amounts to taking the sum of the elements in column $d$ up until time $t*-d$ and dividing by the sum over all the elements to the left of column $d$ up until time $t*-d$, referred to as `block_top` and `block_top_left`, respectively, in the code.
This amounts to taking the sum of the elements in column $d$ up until time $t*-d$ and dividing by the sum over all the elements to the left of column $d$ up until time $t*-d$, referred to as `block_top` and `block_top_left`, respectively, in the code.

Then we can just copy and paste into the paper

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a nice explanation but perhaps there is a slightly more human explanation?

Co-authored-by: Sam Abbott <contact@samabbott.co.uk>
@kaitejohnson
Copy link
Collaborator Author

method 2: iteratively and retrospectively re-compute nowcasts using the same single point estimate of the delay distribution

Why is this permutation in here? Is it because we plan to support it or otherwise?

As discussed f2f, I think we want to nix this so I will edit accordingly.

Copy link

github-actions bot commented Feb 20, 2025

Thank you for your contribution kaitejohnson 🚀! Your website is ready for download 👉 here 👈!
(The artifact expires on 2025-02-25T15:12:11Z. You can re-generate it by re-running the workflow here.)

Co-authored-by: Sam Abbott <contact@samabbott.co.uk>

We add a small number to the mean to avoid an ill-defined negative binomial. We note that to perform all these computations, data snapshots from at least $N +M$ past observations, or rows of the reporting triangle, are needed. This estimate of the uncertainty accounts for the empirical uncertainty in the point estimate of the delay distribution over time.

#### Uncertainty estimate via computing retrospective nowcasts from a single delay distribution, $\pi(d)$
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed face to face we are dropping this


#### Uncertainty estimate via iteratively re-estimating the delay distribution and computing retrospective nowcasts

The first method uses the retrospective incomplete reporting triangle to re-estimate a delay distribution using the $N$ preceding rows of the reporting triangle before $s^*$, and using it to recompute a retrospective nowcast, for $M$ realizations of the retrospective reporting triangle (so $M$ different $s^*$ values).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed face to face we are going to generalise this so it can use any set of historic or retrospective point nowcasts and then provide wrapper tools for more general cases

@kaitejohnson kaitejohnson marked this pull request as draft February 21, 2025 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Methods write-up
2 participants