-
Notifications
You must be signed in to change notification settings - Fork 2
Windows: Add an extra restart onto the end of our buildkite-worker provisioner #102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ovisioner This is another attempt at fixing #42. I noticed that _all_ of our buildkite workers are now rebooting a few seconds after their first (pristine) boot, which makes me think that there's a pending reboot after the provisioning process is complete, which is causing a race condition where the buildkite agent process starts, but then this residual reboot occurs and immediately kills the buildkite process.
We don't ever want our downstream images to be downloading windows updates, we want the base image to get updated, then the downstream images get rebuilt.
So talking with @fredrikekre I wonder if the shutdown that |
So this branch is actually already live (for testing purposes) so we can tell that it's at the very least insufficient, at worst not helpful at all. What I have been able to discover is that there is something that causes the workers to reboot immediately after bootup when taking a job. Because we reset the state of the workers to pristine after every job, this means that this reboot happens every time the worker gets reset. Note that a reboot does not power down the VM, and therefore does not exit the systemd service, and therefore does not trip our "destroy and reset to pristine state" handler, otherwise we'd get stuck in an infinite boot loop. My attempts here were to try and get rid of whatever is causing the reboot to occur. I actually don't know why the VM is rebooting; I thought it might be some kind of residual windows update, or a MS hotfix or something, but so far I have been unable to determine the cause. I tried to catch the graphical console over remote X11 via The root of our issue is that very rarely the reboot is slow enough to start that buildkite is able to grab a job and start processing it, then gets killed by the reboot. Even worse, sometimes buildkite runs far enough to delete the secret key after decrypting a few things, which means that after it reboots, the next job doesn't have a secret key available to decrypt anything. I think if someone can help to figure out why we're rebooting, that would allow us to solve this once and for all. |
I'm copying the relevant log lines from JuliaCI/cryptic-buildkite-plugin#36 (comment) here as well:
So it looks like the culprit might still be Windows Update, and my attempts to disable the auto-updates here are still insufficient? |
Maybe use one of those patches or both: |
Possibly we could try recording the graphical session with https://github.com/JonathonReinhart/spice-record |
This is another attempt at fixing #42.
I noticed that all of our buildkite workers are now rebooting a few
seconds after their first (pristine) boot, which makes me think that
there's a pending reboot after the provisioning process is complete,
which is causing a race condition where the buildkite agent process
starts, but then this residual reboot occurs and immediately kills the
buildkite process.