Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved how anomalous floating-point calculations are handled in the particle filter resamplers. #238

Merged
merged 1 commit into from
Feb 27, 2025

Conversation

JeffreyCovington
Copy link
Contributor

Fixed the handling of two floating-point issues that can occur in the particle filter during likelihood calculations. The behavior in these cases matches the mathematical formulation of the particle filter. These issues are particularly relevant to multi-node simulations.

  1. Not a number (NaN) floating-point values in the data (sometimes used by ADRIOs to indicate missing or invalid data) are now omitted from the joint likelihood calculation when some, but not all, nodes have missing data. This also prevents NaN log-weight values from being passed directly into the resampler.
  2. Additional fail-safes were added to the ResamplingByNode resampler to catch and properly handle underflow-related issues. This brings the underflow handling of the ResamplingByNode resampler in line with the default resampler.

… filter resamplers. Added additional fail-safes to the `ResamplingByNode` resampler to avoid underflow issues.
@JavadocMD
Copy link
Contributor

No complaints with the changes as written. I'm considering the implications for the ADRIO interface. Relying on case counts to be np.nan when missing implies the observational data is typed np.float64, since np.int64 doesn't support nan. This isn't hugely problematic, but feels messy -- it seems tidier for ADRIOs to return the data type that best matches their data -- e.g., integers for integer data, as many of the CDC sources are in truth.

This is why I'm leaning towards ADRIOs returning masked arrays for missing/invalid values -- masks work the same way across all types.

That said, I think we can actually meet in the middle if working with nans makes better sense for the PF computation. This might involve doing some early processing on the observation data input, say circa particle_filter.py#L173:

data = np.ma.asarray(cases, dtype=np.float64).filled(np.nan)

In some quick testing, cases can be a scalar, python list, regular numpy array, or masked numpy array; data comes out as a regular numpy array of floats, possibly containing nans in each case.

What do you think of that approach?

@JeffreyCovington
Copy link
Contributor Author

I consider this PR to be a bug fix since the particle filter should have had this behavior (dealing with nan and multi-node underflow) to begin with given the examples using the PF were aimed at the CDC ADRIOs.

In general, working with np.nan's is not ideal for the reasons you mentioned. I envision that eventually the PF will be refactored to use a masked-array approach (whether using np.ma or something custom) internally and will support both np.int64 and np.float64 data.

Once the ADRIO interface is refactored, your suggested fix will work well until the PF refactor can be implemented.

@JeffreyCovington JeffreyCovington merged commit 9b147cc into main Feb 27, 2025
1 check passed
@JeffreyCovington JeffreyCovington deleted the pf-floating-point-errors branch February 27, 2025 01:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve how the particle filter handles observations in multi-node simulations
2 participants