[reconfigurator] Fix assert_planning_makes_no_changes() #7512

jgallagher · 2025-02-10T22:30:21Z

Despite its name, this test function didn't actually invoke the planner. Now it does, and this fixes up some tests that now fail (and changes the behavior of the planner in one case where the test failure arguably indicated some planner misbehavior).

The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible. This avoids having to make multiple planner iterations to repopulate a sled that had a subset of disks/zones expunged, while maintaining the existing behavior for new sleds being added.

I'm not totally convinced that the existing behavior for new sleds is necessary; since we're only waiting for the presence of an NTP zone, not an inventory report that time is actually sync'd, I'm not sure there's a meaningful difference to requiring two planning iterations instead of one to place extra zones. But I don't have a reason to propose changing it or testing it other than "seems a little over-cautious", so it seems fine to leave it.

Despite its name, this test function didn't actually invoke the planner. Now it does, and this fixes up some tests that now fail (and changes the behavior of the planner in one case where the test failure arguably indicated some planner misbehavior).

jgallagher · 2025-02-10T22:31:09Z

This change is a followup from #7286 (comment). With these changes, that PR now passes that test with the planner invoked as intended.

andrewjstone

Nice fix. Thanks again @jgallagher.

davepacheco · 2025-02-11T19:08:10Z

I'm not totally convinced that the existing behavior for new sleds is necessary; since we're only waiting for the presence of an NTP zone, not an inventory report that time is actually sync'd,

At least when this was written (and I don't know that it's changed), the behavior was:

if time is not sync'd, PUT /omicron-zones refuses any request to PUT anything other than an NTP zone
(I believe) the intent was to guarantee that for all non-NTP zones, they could assume time was sync'd before they started. Although CockroachDB is the only zone that we talk about requiring timesync (I think), I generally assume just about all software should be allowed to assume the clock is sync'd. (At least, we don't routinely test other components with old clocks or clocks that jump around by huge intervals.)

Thus, there needs to be a generation of OmicronZonesConfig (presumably generation 2) that only has the NTP zone and nothing else; otherwise, if generation 2 had other zones in it, we couldn't deploy any of them (including the NTP zone).

But the planner does not need to wait for time to be sync'd before creating an OmicronZonesConfig that includes other zones because if time is not sync'd at execution-time, then Sled Agent will refuse the request. That's fine because it'll be retried on the next execution attempt. At runtime, what happens is exactly what we'd want: we provision the NTP zone right away and provision the rest when time is sync'd.

Why not have the planner check for time sync? There's a time-of-check-to-time-of-use race here: if the sled bounces between planning and execution, then even though the planner thought the sled was ready for more zones, it really wasn't. Hence, we need the check in PUT /omicron-zones anyway. And given that it's there, there's not much point in the planner also checking this. (You can have a TOCTOU the other way, too: thinking time is not sync'd, but it is sync'd by the time you would have gone and executed it. This wouldn't matter much, but this means the planner can be wrong in both directions, which makes it feel especially pointless to bother checking it there.)

Another approach here, which might be more feasible if/when we do the work to make PUT /omicron-zones async, is to have sled agent accept requests with non-NTP zones even when time is not sync'd, but ensure that it provisions NTP and waits for time sync before doing anything else. I think that'd be fine, but I gather it's a pretty big change to how that API works (and it's not obviously better in this regard).

The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible.

I thought this was correct for a more subtle reason than it sounds, but as I write it out, I think it might not quite be right. Suppose:

sled is alive and well with NTP and, say, CockroachDB
it's okay if the sled reboots and loses timesync because Sled Agent knows enough to start NTP and wait for time sync before starting CockroachDB
NTP zone gets expunged, maybe due to a disk failure
sled reboots
on startup, I think we're toast:
- sled agent cannot start any of its zones because it doesn't have an NTP zone with which to sync time
- sled agent also cannot accept any requests that would add an NTP zone due to the check mentioned above

I can think of a few ways to mitigate this (I suspect there are others):

If we expunge an NTP zone, we always place its replacement in the same blueprint step. This helps the case above where the zone gets expunged in the same operation that a new one gets placed. However, if the expungement doesn't happen at all, and if the disk really is dead, and the sled reboots, it won't be able to start NTP or anything else. Will the rest of the system be able to repair it? It depends on the exact conditions under which Sled Agent refuses to accept a PUT /omicron-zones request.
If we expunge an NTP zone, we also expunge all the zones on that sled (or even the whole sled?!). This is obviously a big hammer and kind of sucks, but wouldn't matter that much if it weren't for all the Crucible data that we'd have to rewrite.
Make the request async as mentioned above.
Allow requests to PUT /omicron-zones with both NTP and other zones (maybe only if generation > 2?), but start NTP first and wait for timesync. I can't remember why we didn't do this before. One reason it could be bad is that timesync can take a while and this request could take a while and that would be potentially bad.

davepacheco · 2025-02-11T18:29:48Z

nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

@@ -2388,14 +2361,6 @@ pub mod test {
            Some(SledState::Decommissioned),
        );

-        // Test a no-op planning iteration.


Why did we stop testing this here?

This test is checking some of the backwards-compatibility stuff in BlueprintBuilder specifically, and doesn't use the planner at all. The final blueprint it spits out would need changes if run through the planner (specifically, to replace a bunch of services on a sled that the test decommissioned).

davepacheco · 2025-02-11T18:34:08Z

nexus/types/src/deployment/zone_type.rs

+    /// Identifies whether this a CockroachDB zone
+    pub fn is_cockroach(&self) -> bool {
+        matches!(self, BlueprintZoneType::CockroachDb(_))
+    }


This seems fine but I don't see it used anywhere here?

It's used in a test:

omicron/nexus/reconfigurator/planning/src/blueprint_builder/builder.rs

Line 2821 in 1203b32

!z.zone_type.is_cockroach(),

(sorry for the dubious link; this line is in the diff but not sure how to convince github to show that, so this is just to the line on a raw commit)

davepacheco · 2025-02-11T19:21:43Z

To be clear, I don't think this PR makes the problem that I mentioned any worse. Going back to this:

The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible.

Finding non-NTP zones on a sled means we know there was an NTP zone there at one point and time was sync'd at some point. Then one of a few things could happen:

The NTP zone is still there (whether or not time is sync'd, either now or when we execute this blueprint): we can ignore this case because this code path is explicitly for the case where there is no NTP zone there.
The sled does not reboot before this blueprint step is successfully executed so time is still sync'd when we process the PUT /omicron-zones request: in that case, this code change is fine and has the intended effect of reducing planner laps.
The sled does reboot (or otherwise lose time sync) before we execute this blueprint step: this is the case I mentioned in the previous comment. I believe the assumption underlying this change is broken and the code was broken before anyway.

jgallagher · 2025-02-12T16:10:37Z

The sled does reboot (or otherwise lose time sync) before we execute this blueprint step: this is the case I mentioned in the previous comment. I believe the assumption underlying this change is broken and the code was broken before anyway.

We chatted about this offline - I think based on the changes in #5012, we're actually okay here - we'll start up the NTP zone first, fail to start the other zones, but keep trying (either because load_services in a cold boot has a "retry until failure" loop, or because of periodic PUT /omicron-zones requests from the blueprint executor).

That said:

Keeping explicit steps in the planner when setting up a new sled seems fine and clearer, even if not strictly required
The sled-agent side is still kind of confusing, and "make it async with a reconciler loop" does seem like a nicer place to be (eventually)

Given all of that, I think the changes on this PR are okay?

jgallagher requested review from andrewjstone and davepacheco February 10, 2025 22:30

jgallagher mentioned this pull request Feb 10, 2025

Expunge and Decommission disks in planner #7286

Open

andrewjstone approved these changes Feb 10, 2025

View reviewed changes

davepacheco reviewed Feb 11, 2025

View reviewed changes

davepacheco approved these changes Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[reconfigurator] Fix assert_planning_makes_no_changes() #7512

[reconfigurator] Fix assert_planning_makes_no_changes() #7512

jgallagher commented Feb 10, 2025

jgallagher commented Feb 10, 2025

andrewjstone left a comment

davepacheco commented Feb 11, 2025

davepacheco Feb 11, 2025

jgallagher Feb 11, 2025

davepacheco Feb 11, 2025

jgallagher Feb 11, 2025

davepacheco commented Feb 11, 2025

jgallagher commented Feb 12, 2025

[reconfigurator] Fix assert_planning_makes_no_changes() #7512

Are you sure you want to change the base?

[reconfigurator] Fix assert_planning_makes_no_changes() #7512

Conversation

jgallagher commented Feb 10, 2025

jgallagher commented Feb 10, 2025

andrewjstone left a comment

Choose a reason for hiding this comment

davepacheco commented Feb 11, 2025

davepacheco Feb 11, 2025

Choose a reason for hiding this comment

jgallagher Feb 11, 2025

Choose a reason for hiding this comment

davepacheco Feb 11, 2025

Choose a reason for hiding this comment

jgallagher Feb 11, 2025

Choose a reason for hiding this comment

davepacheco commented Feb 11, 2025

jgallagher commented Feb 12, 2025