-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[reconfigurator] Fix assert_planning_makes_no_changes() #7512
base: main
Are you sure you want to change the base?
Conversation
Despite its name, this test function didn't actually invoke the planner. Now it does, and this fixes up some tests that now fail (and changes the behavior of the planner in one case where the test failure arguably indicated some planner misbehavior).
This change is a followup from #7286 (comment). With these changes, that PR now passes that test with the planner invoked as intended. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice fix. Thanks again @jgallagher.
At least when this was written (and I don't know that it's changed), the behavior was:
Thus, there needs to be a generation of But the planner does not need to wait for time to be sync'd before creating an Why not have the planner check for time sync? There's a time-of-check-to-time-of-use race here: if the sled bounces between planning and execution, then even though the planner thought the sled was ready for more zones, it really wasn't. Hence, we need the check in Another approach here, which might be more feasible if/when we do the work to make
I thought this was correct for a more subtle reason than it sounds, but as I write it out, I think it might not quite be right. Suppose:
I can think of a few ways to mitigate this (I suspect there are others):
|
@@ -2388,14 +2361,6 @@ pub mod test { | |||
Some(SledState::Decommissioned), | |||
); | |||
|
|||
// Test a no-op planning iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we stop testing this here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is checking some of the backwards-compatibility stuff in BlueprintBuilder
specifically, and doesn't use the planner at all. The final blueprint it spits out would need changes if run through the planner (specifically, to replace a bunch of services on a sled that the test decommissioned).
/// Identifies whether this a CockroachDB zone | ||
pub fn is_cockroach(&self) -> bool { | ||
matches!(self, BlueprintZoneType::CockroachDb(_)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine but I don't see it used anywhere here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's used in a test:
!z.zone_type.is_cockroach(), |
To be clear, I don't think this PR makes the problem that I mentioned any worse. Going back to this:
Finding non-NTP zones on a sled means we know there was an NTP zone there at one point and time was sync'd at some point. Then one of a few things could happen:
|
We chatted about this offline - I think based on the changes in #5012, we're actually okay here - we'll start up the NTP zone first, fail to start the other zones, but keep trying (either because That said:
Given all of that, I think the changes on this PR are okay? |
Despite its name, this test function didn't actually invoke the planner. Now it does, and this fixes up some tests that now fail (and changes the behavior of the planner in one case where the test failure arguably indicated some planner misbehavior).
The specific planner change: Previously, if we added an NTP zone to a sled, we considered that sled ineligible for any other zones (consistent with the sled-agent behavior of refusing to start zones if time isn't sync'd yet). This PR relaxes that slightly: if we're adding an NTP zone to a sled and that sled already has other zones that needed time to be sync'd, we do consider it eligible. This avoids having to make multiple planner iterations to repopulate a sled that had a subset of disks/zones expunged, while maintaining the existing behavior for new sleds being added.
I'm not totally convinced that the existing behavior for new sleds is necessary; since we're only waiting for the presence of an NTP zone, not an inventory report that time is actually sync'd, I'm not sure there's a meaningful difference to requiring two planning iterations instead of one to place extra zones. But I don't have a reason to propose changing it or testing it other than "seems a little over-cautious", so it seems fine to leave it.