This page is intended to provide teams with all the information they need to submit forecasts. All forecasts should be submitted directly to the data-processed/ folder. Data in this directory should be added to the repository through a pull request so that automatic data validation checks are run.
These instructions provide detail about the data format as well as validation that you can do prior to this pull request. In addition, we describe meta-data that each model should provide.
Table of Contents
There are several different sources for death data. Currently, all forecasts will be compared to the daily reports containing death data from the JHU CSSE group as the gold standard reference data for deaths in the US. Note that there are significant differences (especially in daily incident death data) between the JHU data and another commonly used source, from the New York Times. The team at UTexas-Austin is tracking this issue on a separate GitHub repository.
Weekly incident data are the sum of daily incident data from Sunday through Saturday. Weekly cumulative data is the cumulative data up to and including Saturday.
We may add additional sources of ground-truth data at a future time.
The automatic check validates both the filename and file contents to ensure the file can be used in the visualization and ensemble forecasting.
Each subdirectory within the data-processed/ directory has the format
team-model
where
team
is the teamname andmodel
is the name of your model.
Both team and model should be less than 15 characters and not include hyphens.
Within each subdirectory, there should be a metadata file, a license file (optional), and a set of forecasts.
The metadata file should have the following format
metadata-team-model.txt
and here is the structure of the metadata file.
If you would like to include a license file, please use the following format
LICENSE.txt
If you are not using one of the standard licenses, then you must include a license file.
Each forecast file within the subdirectory should have the following format
YYYY-MM-DD-team-model.csv
where
YYYY
is the 4 digit year,MM
is the 2 digit month,DD
is the 2 digit day,team
is the teamname, andmodel
is the name of your model.
The date YYYY-MM-DD is the forecast_date
.
The team
and model
in this file must match the team
and model
in the
directory this file is in.
Both team
and model
should be less than 15 characters,
alpha-numeric and underscores only, with no spaces or hyphens.
The file must be a comma-separated value (csv) file with the following columns (in any order):
forecast_date
target
target_end_date
location
type
quantile
value
No additional columns are allowed.
Each row in the file is either a point or quantile forecast for a location on a particular date for a particular target.
Values in the forecast_date
column must be a date in the format
YYYY-MM-DD
This is the date on which the submitted forecast were available.
This will typically be the date on which the computation finishes running and
produces the standard formatted file.
forecast_date
should correspond and be redundant with the date in the filename,
but is included here by request from some analysts.
We will enforce that the forecast_date
for a file must be either the date on
which the file was submitted to the repository or the previous day.
Exceptions will be made for legitimate extenuating circumstances.
Values in the target
column must be a character (string) and be one of the
following specific targets:
- "N wk ahead cum death" where N is a number between 1 and 20
- "N wk ahead inc death" where N is a number between 1 and 20
- "N wk ahead inc case" where N is a number between 1 and 8
- "N day ahead inc hosp" where N is a number between 0 and 130
For county locations, the only target should be "N wk ahead inc case".
For week-ahead forecasts, we will use the specification of epidemiological weeks (EWs) defined by the US CDC which run Sunday through Saturday. There are standard software packages to convert from dates to epidemic weeks and vice versa. E.g. MMWRweek for R and pymmwr and epiweeks for python.
We have created a csv file describing forecast collection dates and dates for which forecasts refer to can be found.
For week-ahead forecasts with forecast_date
of Sunday or Monday of EW12,
a 1 week ahead forecast corresponds to EW12 and should have target_end_date
of
the Saturday of EW12. For week-ahead forecasts with forecast_date
of Tuesday
through Saturday of EW12, a 1 week ahead forecast corresponds to EW13 and should
have target_end_date
of the Saturday of EW13.
This target is the cumulative number of deaths predicted by the model up to
and including N weeks after forecast_date
.
A week-ahead forecast should represent the cumulative number of deaths reported on the Saturday of a given epiweek.
Predictions for this target will be evaluated compared to the cumulative of the number of new reported cases, as recorded by JHU CSSE.
This target is the incident (weekly) number of deaths predicted by the model
during the week that is N weeks after forecast_date
.
A week-ahead forecast should represent the total number of new deaths reported during a given epiweek (from Sunday through Saturday, inclusive).
Predictions for this target will be evaluated compared to the number of new reported cases, as recorded by JHU CSSE.
This target is the incident (weekly) number of cases predicted by the model
during the week that is N weeks after forecast_date
.
A week-ahead forecast should represent the total number of new cases reported during a given epiweek (from Sunday through Saturday, inclusive).
Predictions for this target will be evaluated compared to the number of new reported cases, as recorded by JHU CSSE.
This target is the number of new daily hospitalizations predicted by the
model on day N after forecast_date
.
As an example, for day-ahead forecasts with a forecast_date
of a Monday,
a 1 day ahead inc hosp forecast corresponds to the number of incident
hospitalizations on Tuesday, 2 day ahead to Wednesday, etc....
Currently there is no "gold standard" for hospitalization data.
On 2020-06-06, these targets were removed:
- N day ahead inc death
- N day ahead cum death
Values in the target_end_date
column must be a date in the format
YYYY-MM-DD
This is the date for the forecast target
.
For "# day" targets, target_end_date
will be # days after forecast_date
.
For "# wk" targets, target_end_date
will be the Saturday at the end of the
week time period.
Values in the location
column must be one of the "locations" in this
FIPS numeric code file which includes
numeric FIPS codes for U.S. states, counties, territories, and districts as
well as "US" for national forecasts.
Please note that when writing FIPS codes, they should be written in as a character string to preserve any leading zeroes.
Values in the type
column are either
- "point" or
- "quantile".
This value indicates whether that row corresponds to a point forecast or a quantile forecast. Point forecasts are used in visualization while quantile forecasts are used in visualization and in ensemble construction.
Forecasts must include exactly 1 "point" forecast for every location-target pair.
Values in the quantile
column are either "NA" (if type
is "point") or
a quantile in the format
0.###
For quantile forecasts, this value indicates the quantile for the value
in this row.
Teams should provide the following 23 quantiles:
c(0.01, 0.025, seq(0.05, 0.95, by = 0.05), 0.975, 0.99)
## [1] 0.010 0.025 0.050 0.100 0.150 0.200 0.250 0.300 0.350 0.400 0.450 0.500 0.550 0.600 0.650 0.700 0.750
## [18] 0.800 0.850 0.900 0.950 0.975 0.990
for all target
s except "N wk ahead inc case" target.
For the "N wk ahead inc case" target, teams should provide the following 6
quantiles:
c(0.025, 0.100, 0.250, 0.500, 0.750, 0.900, 0.975)
## [1] 0.025 0.100 0.250 0.500 0.750 0.900 0.975
Values in the value
column are non-negative numbers indicating the "point" or
"quantile" prediction for this row.
For a "point" prediction, value
is simply the value of that point prediction
for the target
and location
associated with that row.
For a "quantile" prediction, value
is the inverse of the cumulative
distribution function (CDF)
for the target
, location
, and quantile
associated with that row.
An example inverse CDF is below.
To ensure proper data formatting, pull requests for new data in data-processed/ will be automatically run.
When a pull request is submitted, the data are validated through Travis CI which runs the tests in test_formatting.py. The intent for these tests are to validate the requirements above and specifically enumerated on the wiki. Please let us know if the wiki is inaccurate.
If the pull request fails, please follow these instructions for details on how to troubleshoot.
To run these checks locally rather than waiting for the results from a pull request, follow these instructions.
If you cannot get the python checks to run, you can use these instructions to run some checks in R. These checks are no longer maintained, but may still be of use to teams working with R.
If you want to visualize your forecasts, you can use our R shiny app to visualize your forecast by running
source("explore_processed_data.R")
shinyApp(ui = ui, server = server)
from within the data-processed/ folder. This is mainly an internal tool we use to help us know what forecasts are in the repository. Thus, it is provided as-is within no warranty.