Skip to content

Windows: Add an extra restart onto the end of our buildkite-worker provisioner #102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

staticfloat
Copy link
Member

This is another attempt at fixing #42.

I noticed that all of our buildkite workers are now rebooting a few
seconds after their first (pristine) boot, which makes me think that
there's a pending reboot after the provisioning process is complete,
which is causing a race condition where the buildkite agent process
starts, but then this residual reboot occurs and immediately kills the
buildkite process.

…ovisioner

This is another attempt at fixing #42.

I noticed that _all_ of our buildkite workers are now rebooting a few
seconds after their first (pristine) boot, which makes me think that
there's a pending reboot after the provisioning process is complete,
which is causing a race condition where the buildkite agent process
starts, but then this residual reboot occurs and immediately kills the
buildkite process.
We don't ever want our downstream images to be downloading windows
updates, we want the base image to get updated, then the downstream
images get rebuilt.
@gbaraldi
Copy link
Member

So talking with @fredrikekre I wonder if the shutdown that nssm calls does not have a sucess exit code. In which case nssm doesn't exit. I wonder then if we should instead shutdown a couple seconds later, or spawn the shutdown in a separate process so our CMD doesn't get killed and nssm doesn't think to restart it. Also we could increase the restart timer. What do you think?

@staticfloat
Copy link
Member Author

So this branch is actually already live (for testing purposes) so we can tell that it's at the very least insufficient, at worst not helpful at all.

What I have been able to discover is that there is something that causes the workers to reboot immediately after bootup when taking a job. Because we reset the state of the workers to pristine after every job, this means that this reboot happens every time the worker gets reset. Note that a reboot does not power down the VM, and therefore does not exit the systemd service, and therefore does not trip our "destroy and reset to pristine state" handler, otherwise we'd get stuck in an infinite boot loop.

My attempts here were to try and get rid of whatever is causing the reboot to occur. I actually don't know why the VM is rebooting; I thought it might be some kind of residual windows update, or a MS hotfix or something, but so far I have been unable to determine the cause. I tried to catch the graphical console over remote X11 via virt-manager to see if there's a window that pops up or anything, but it reboots so fast the graphical console doesn't connect by the time it reboots.

The root of our issue is that very rarely the reboot is slow enough to start that buildkite is able to grab a job and start processing it, then gets killed by the reboot. Even worse, sometimes buildkite runs far enough to delete the secret key after decrypting a few things, which means that after it reboots, the next job doesn't have a secret key available to decrypt anything.

I think if someone can help to figure out why we're rebooting, that would allow us to solve this once and for all.

@staticfloat
Copy link
Member Author

I'm copying the relevant log lines from JuliaCI/cryptic-buildkite-plugin#36 (comment) here as well:

5/22/2025 5:22:24 AM 1074 Information      The process C:\Windows\system32\winlogon.exe (WIN2K22-AMDCI6-) has initiated the restart of computer WIN2K22-AMDCI6- on behalf of user
                                           NT AUTHORITY\SYSTEM for the following reason: Operating System: Upgrade (Planned)
                                            Reason Code: 0x80020003
                                            Shutdown Type: restart
                                            Comment:

So it looks like the culprit might still be Windows Update, and my attempts to disable the auto-updates here are still insufficient?

@inkydragon
Copy link
Contributor

my attempts to disable the auto-updates here are still insufficient?

Maybe use one of those patches or both:

@Keno
Copy link
Member

Keno commented May 26, 2025

I tried to catch the graphical console over remote X11 via virt-manager to see if there's a window that pops up or anything, but it reboots so fast the graphical console doesn't connect by the time it reboots.

Possibly we could try recording the graphical session with https://github.com/JonathonReinhart/spice-record

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants