Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reconfigurator] Fix assert_planning_makes_no_changes() #7512

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jgallagher
Copy link
Contributor

Despite its name, this test function didn't actually invoke the planner. Now it does, and this fixes up some tests that now fail (and changes the behavior of the planner in one case where the test failure arguably indicated some planner misbehavior).

The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible. This avoids having to make multiple planner iterations to repopulate a sled that had a subset of disks/zones expunged, while maintaining the existing behavior for new sleds being added.

I'm not totally convinced that the existing behavior for new sleds is necessary; since we're only waiting for the presence of an NTP zone, not an inventory report that time is actually sync'd, I'm not sure there's a meaningful difference to requiring two planning iterations instead of one to place extra zones. But I don't have a reason to propose changing it or testing it other than "seems a little over-cautious", so it seems fine to leave it.

Despite its name, this test function didn't actually invoke the planner.
Now it does, and this fixes up some tests that now fail (and changes the
behavior of the planner in one case where the test failure arguably
indicated some planner misbehavior).
@jgallagher
Copy link
Contributor Author

This change is a followup from #7286 (comment). With these changes, that PR now passes that test with the planner invoked as intended.

Copy link
Contributor

@andrewjstone andrewjstone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice fix. Thanks again @jgallagher.

@davepacheco
Copy link
Collaborator

I'm not totally convinced that the existing behavior for new sleds is necessary; since we're only waiting for the presence of an NTP zone, not an inventory report that time is actually sync'd,

At least when this was written (and I don't know that it's changed), the behavior was:

  • if time is not sync'd, PUT /omicron-zones refuses any request to PUT anything other than an NTP zone
  • (I believe) the intent was to guarantee that for all non-NTP zones, they could assume time was sync'd before they started. Although CockroachDB is the only zone that we talk about requiring timesync (I think), I generally assume just about all software should be allowed to assume the clock is sync'd. (At least, we don't routinely test other components with old clocks or clocks that jump around by huge intervals.)

Thus, there needs to be a generation of OmicronZonesConfig (presumably generation 2) that only has the NTP zone and nothing else; otherwise, if generation 2 had other zones in it, we couldn't deploy any of them (including the NTP zone).

But the planner does not need to wait for time to be sync'd before creating an OmicronZonesConfig that includes other zones because if time is not sync'd at execution-time, then Sled Agent will refuse the request. That's fine because it'll be retried on the next execution attempt. At runtime, what happens is exactly what we'd want: we provision the NTP zone right away and provision the rest when time is sync'd.


Why not have the planner check for time sync? There's a time-of-check-to-time-of-use race here: if the sled bounces between planning and execution, then even though the planner thought the sled was ready for more zones, it really wasn't. Hence, we need the check in PUT /omicron-zones anyway. And given that it's there, there's not much point in the planner also checking this. (You can have a TOCTOU the other way, too: thinking time is not sync'd, but it is sync'd by the time you would have gone and executed it. This wouldn't matter much, but this means the planner can be wrong in both directions, which makes it feel especially pointless to bother checking it there.)

Another approach here, which might be more feasible if/when we do the work to make PUT /omicron-zones async, is to have sled agent accept requests with non-NTP zones even when time is not sync'd, but ensure that it provisions NTP and waits for time sync before doing anything else. I think that'd be fine, but I gather it's a pretty big change to how that API works (and it's not obviously better in this regard).

The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible.

I thought this was correct for a more subtle reason than it sounds, but as I write it out, I think it might not quite be right. Suppose:

  • sled is alive and well with NTP and, say, CockroachDB
  • it's okay if the sled reboots and loses timesync because Sled Agent knows enough to start NTP and wait for time sync before starting CockroachDB
  • NTP zone gets expunged, maybe due to a disk failure
  • sled reboots
  • on startup, I think we're toast:
    • sled agent cannot start any of its zones because it doesn't have an NTP zone with which to sync time
    • sled agent also cannot accept any requests that would add an NTP zone due to the check mentioned above

I can think of a few ways to mitigate this (I suspect there are others):

  1. If we expunge an NTP zone, we always place its replacement in the same blueprint step. This helps the case above where the zone gets expunged in the same operation that a new one gets placed. However, if the expungement doesn't happen at all, and if the disk really is dead, and the sled reboots, it won't be able to start NTP or anything else. Will the rest of the system be able to repair it? It depends on the exact conditions under which Sled Agent refuses to accept a PUT /omicron-zones request.
  2. If we expunge an NTP zone, we also expunge all the zones on that sled (or even the whole sled?!). This is obviously a big hammer and kind of sucks, but wouldn't matter that much if it weren't for all the Crucible data that we'd have to rewrite.
  3. Make the request async as mentioned above.
  4. Allow requests to PUT /omicron-zones with both NTP and other zones (maybe only if generation > 2?), but start NTP first and wait for timesync. I can't remember why we didn't do this before. One reason it could be bad is that timesync can take a while and this request could take a while and that would be potentially bad.

@@ -2388,14 +2361,6 @@ pub mod test {
Some(SledState::Decommissioned),
);

// Test a no-op planning iteration.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we stop testing this here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is checking some of the backwards-compatibility stuff in BlueprintBuilder specifically, and doesn't use the planner at all. The final blueprint it spits out would need changes if run through the planner (specifically, to replace a bunch of services on a sled that the test decommissioned).

Comment on lines +170 to +173
/// Identifies whether this a CockroachDB zone
pub fn is_cockroach(&self) -> bool {
matches!(self, BlueprintZoneType::CockroachDb(_))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine but I don't see it used anywhere here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used in a test:

(sorry for the dubious link; this line is in the diff but not sure how to convince github to show that, so this is just to the line on a raw commit)

@davepacheco
Copy link
Collaborator

To be clear, I don't think this PR makes the problem that I mentioned any worse. Going back to this:

The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible.

Finding non-NTP zones on a sled means we know there was an NTP zone there at one point and time was sync'd at some point. Then one of a few things could happen:

  1. The NTP zone is still there (whether or not time is sync'd, either now or when we execute this blueprint): we can ignore this case because this code path is explicitly for the case where there is no NTP zone there.
  2. The sled does not reboot before this blueprint step is successfully executed so time is still sync'd when we process the PUT /omicron-zones request: in that case, this code change is fine and has the intended effect of reducing planner laps.
  3. The sled does reboot (or otherwise lose time sync) before we execute this blueprint step: this is the case I mentioned in the previous comment. I believe the assumption underlying this change is broken and the code was broken before anyway.

@jgallagher
Copy link
Contributor Author

The sled does reboot (or otherwise lose time sync) before we execute this blueprint step: this is the case I mentioned in the previous comment. I believe the assumption underlying this change is broken and the code was broken before anyway.

We chatted about this offline - I think based on the changes in #5012, we're actually okay here - we'll start up the NTP zone first, fail to start the other zones, but keep trying (either because load_services in a cold boot has a "retry until failure" loop, or because of periodic PUT /omicron-zones requests from the blueprint executor).

That said:

  • Keeping explicit steps in the planner when setting up a new sled seems fine and clearer, even if not strictly required
  • The sled-agent side is still kind of confusing, and "make it async with a reconciler loop" does seem like a nicer place to be (eventually)

Given all of that, I think the changes on this PR are okay?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants