-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stability issues in the numerical integral solver #34
Comments
This may have some useful guidance: https://mc-stan.org/docs/stan-users-guide/one-dimensional-integrals.html From skimming it looks like the support for |
I tried implementing xc as suggested as went from having quadrature failures due to potential singularities to having the integral not be numerically solvable : https://github.com/epinowcast/primarycensoreddist/blob/2be97fe494da5306e079772d372dcf96d5da19a7/inst/stan/primary_censored_dist.stan#L129 I also moved d from being a parameter to being data (this had limited impact) Giving up on that angle and instead investigating reparameterising the integral in terms of the delay vs the censoring window |
I have now also explored solving the integral between min delay and max delay instead of over the primary event window. With this approach I find that without using xc I have tsinh based failures in the integral solver and with xc I have error estimate of the integral exceeds the tolerance for ever regardless of the tolerance. |
It looks like one of the main causes of failure s very large values of mu or very small values of sigma when initialising. Ideally integrate_1d would reject on error but it looks like it instead just errors and causes things to crash. If we had a try catch or similar here that would be perfect but as far as I am aware we don't |
Adding safety catches and dealing with some weird behaviour where bounds can be negative (despite being set not to be) I have got this to reliably initialise. The remaining challenge is that during sampling chains are very likely to family. See below for an example.
|
Current progress on this from a large run of 256 chains (243 failed all part way through sampling (i.e post warmup and initialisation).
Current work on this is on: |
Dropping the integration relative tolerance to 1e-2 (from 1e-6) I see a marked improvement in chain survival 47 failures out of 256 chains) with little impact on estimates (in this example)
Dropping it still further to 1e-1 all chains near instantly failed (i.e. in warmup). |
Bumping up to 600 chains I see:
Adding if (d_adj <= 1e-3) {
return(0);
} and repeating with 600 chains I see:
i.e. 25% failure rate roughly for both No apparent difference between estimates from both runs. |
Increasing coverage that uses the xc as follows (from 0.2):
I see (again over 600 chains):
which drops it down to a 22.5% failure rate. I'm a bit worried about this approach to detecting boundaries as the integral is not always going to be over pwindow length (as the lower bound is max(d - pwindow, 1e-8)) unfortunately it isn't trivial to pass in the actual length of the interval without annoying requirements on the user (as must be "data"). |
Repeating (again 600 chains) but dropping the boundary to 0.1*pwindow I see the following:
which is a 26% failure rate so tuning this is making some difference but not a huge amount. |
Removing truncation adjustment (so obviously incorrect estimates):
which is a 5.5% failure rate. This massive drop does suggest the interaction between the cdf and the obs time cdf is a major problem (which does of course make sense). You still see the same thing when changing the return to For interest I tried changing the return expression to not be on the log scale and that just causes everything to fail |
Things are now stable enough at initialisation that we can drop truncation of Something to note here is that failures are coming from both the upper (2) and lower bound (1) calls. I've repeated this a few times and it just seems random. Follow us tasks are to start permuting the dataset to see what impact that has and to try and understand what causes singularities (i.e infs) in the integral.
|
Now varying up the amount of truncation (D is the obs time) with 16 chains so results have high variance. We see as expected that we have fewer issues with less truncation but when we allow for very long delays the number of failures spikes. My guess is that this is because we have lots of very low probabilities in the tail and these are also unstable (but I would imagine unstable in a different way to the truncation (need to repeat the test without truncation in the model to see). Surprisingly to me based on investigation so far when you have very heavily truncated delays you get increasingly fewer delays with the failure hump being 8-12 days in this example. This breaks down at 2 though. This few chains means that this is all ver random D = 2: 87.5% |
I'm now experimenting with a mid point approximation when the cdf is very small or very big. This could have some weird issues with different primary event distributions. I think if it works an improved version would to choose the mid point based on the mode of the primary event distribution? |
Code currently on main (e800dfd) with 600 chains gives.
So a 15% failure rate. Failures are all post-initialisation It looks like most of the failures we are now seeing are of the following form (this was at 250 warm up steps):
Given I assume the lack of support for |
Not passing the now redundant Warning: 70 chain(s) finished unexpectedly!
The remaining chains had a mean execution time of 736.8 seconds.
Warning: The returned fit object will only read in results of successful chains. Please use read_cmdstan_csv() to read the results of the failed chains separately.Use the $output(chain_id) method for more output of the failed chains.
variable mean median sd mad q5 q95 rhat ess_bulk
lp__ -2134.46 -2134.15 1.02 0.73 -2136.49 -2133.50 1.00 196110
mu 1.53 1.53 0.05 0.05 1.46 1.62 1.00 164346
sigma 0.77 0.77 0.04 0.04 0.71 0.83 1.00 162250 A 12% failure rate |
Potentially we can use an alternative is trying to get a reject statement to work inside the integrand which so far hasn't worked out for me. If we had a try catch around the integral that also seems like it would prevent edge cases but as far as I am aware that is not something we can do in stan. I wonder if the ODE solver does have this though and therefore is worth trying out as a drop in replacement. |
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
lp__ -2134.46 -2134.15 1.01 0.73 -2136.47 -2133.50 1.00 223425 267773
mu 1.53 1.53 0.05 0.05 1.46 1.62 1.00 191254 199843
sigma 0.77 0.77 0.04 0.04 0.71 0.83 1.00 189038 221605 with no apparent impact on the estimated values. |
In the fitting delay distributions with stan I am seeing numerical instability and with more complex models (i.e. varying windows etc) I am seeing complete failures.
None of these issues are apparent outside of stan which indicates it is a auto-diff issues. Some potential solutions to investigate:
The text was updated successfully, but these errors were encountered: