-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Add retry logic to automatic agent upgrades #212744
base: main
Are you sure you want to change the base?
[Fleet] Add retry logic to automatic agent upgrades #212744
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@@ -362,7 +434,8 @@ export class AutomaticAgentUpgradeTask { | |||
return ( | |||
isAgentUpgradeable(agent) && | |||
(agent.status !== 'updating' || | |||
(AgentStatusKueryHelper.isStuckInUpdating(agent) && | |||
(!agent.upgrade_attempts && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One important piece of this logic is that we can't just pick up agents stuck in updating now since we're explicitly handling retries. I'm allowing agents stuck in updating with no upgrade_attempts
(i.e. that were not upgraded through this task) to be selected.
One thing that came to mind while implementing this is that we could probably make this task more performant by modifying the kuery for selecting candidate agents for upgrade: instead of allowing all status:updating
agents, we could restrict it to only allow status:updating AND NOT upgrade_attempts:* AND upgrade_details.target_version:{version} AND upgrade_details.state:UPG_FAILED
. The only difference would be that it wouldn't pick up agents on older versions without upgrade_details
that are stuck in updating.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juliaElastic Pinging you on this comment for your thoughts on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could include those stuck agents without upgrade_details
in a date range query, something like upgrade_started_at < now-2h
.
One caveat with these agents is we don't have target_version
on them, so we don't necessarily auto upgrade them to the same version as initially, but I think it's still better to try to upgrade them rather than keep them stuck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, pushed a change in this direction.
57a093b
to
f066052
Compare
This comment was marked as outdated.
This comment was marked as outdated.
f066052
to
835b23d
Compare
This comment was marked as outdated.
This comment was marked as outdated.
835b23d
to
5b2c811
Compare
This comment was marked as outdated.
This comment was marked as outdated.
5b2c811
to
28582f7
Compare
This comment was marked as outdated.
This comment was marked as outdated.
@@ -42,7 +44,7 @@ export const TYPE = 'fleet:automatic-agent-upgrade-task'; | |||
export const VERSION = '1.0.0'; | |||
const TITLE = 'Fleet Automatic agent upgrades'; | |||
const SCOPE = ['fleet']; | |||
const INTERVAL = '1h'; | |||
const INTERVAL = '30m'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to the default min retry delay.
…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/fleet --include-path /api/dashboards --update'
x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts
Show resolved
Hide resolved
x-pack/platform/plugins/shared/fleet/server/services/agents/upgrade_action_runner.ts
Outdated
Show resolved
Hide resolved
x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts
Outdated
Show resolved
Hide resolved
Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@elasticmachine merge upstream |
Tested locally with horde agents, and getting this error, it looks like missing
I could get past it by adding the
|
x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts
Outdated
Show resolved
Hide resolved
x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts
Outdated
Show resolved
Hide resolved
} | ||
|
||
private isAgentReadyForRetry(agent: Agent, agentPolicy: AgentPolicy) { | ||
if (!agent.upgrade_attempts) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When testing this, it seems the upgrade_attempts
array keeps being extended with more items even after max attempts are exceeded, and way faster than the retry delays.
Not sure where the bug is, I'm testing with an agent doc that's manually updated to be in failed upgrade state.
It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents
. Probably we shouldn't upgrade there if upgrade_attempts.length > 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reporting this. I'm currently troubleshooting with a real agent and I think with the change in elastic/fleet-server#4528 upgrade_attempts
gets cleared after a failed upgrade, which is not the intent. Let me troubleshoot this further and see if I can reproduce the issue you are seeing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done more testing with real agents and a new approach in elastic/fleet-server#4528. Retrying seems to work for agents with upgrade details (it's hitting the max retry attempts exceeded for agent
log as expected). Not sure if that's due to my last fixes or if I'm missing some flow.
There is however currently a gap for agents with no upgrade details: as you pointed out in #212744 (comment), the ES queries don't work as expected. Furthermore, the current fleet-server change relies on upgrade details, so retries don't work for these agents. I'm still looking at how to improve queries, but I'd welcome some feedback on the approach.
It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0
Currently the logic does this:
- count how many agents need to be on the target version
- subtract number of agents already on it or updating to it
- subtract number of agents marked for retry (with
upgrade_attempts
set) - if this is still a positive number, fetch more agents with
findAndUpgradeCandidateAgents
findAndUpgradeCandidateAgents
should definitely not attempt to upgrade an agent that was marked for retry. In theory the AND (NOT upgrade_attempts:*)
part of the agents fetcher kuery should prevent that (I fixed that). Do you have any concerns about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's okay if we don't retry agents without upgrade details, just count them as updating.
The logic sounds good, we should be able to test that it works as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid setting upgrade_attempts
on agents below 8.12.0 then? If we do that, they would naturally get retried by the task (with no limit to the number of attempts).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that sounds reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pushed the change, along with a fix (the task was erroring if no agents were returned by the fetcher).
I've tested again with real agents below and above 8.12.0, as well as horde agents. I'm not seeing any issues, please let me know if you do.
@elasticmachine merge upstream |
Pinging @elastic/fleet (Team:Fleet) |
💔 Build Failed
Failed CI StepsHistory
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Summary
Relates https://github.com/elastic/ingest-dev/issues/4720
This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019.
Complementary fleet-server change which sets the agent's
upgrade_attempts
tonull
once the upgrade is complete.: elastic/fleet-server#4528Approach
upgrade_attempts
property is added to agents and stored in the agent doc (ES mapping update in [Fleet] Addupgrade_attempts
to.fleet-agents
index elasticsearch#123256).upgrade_attempts
.['30m', '1h', '2h', '4h', '8h', '16h', '24h']
and can be overridden with the newxpack.fleet.autoUpgrades.retryDelays
setting.Testing
The ES query for fetching agents with existing
upgrade_attempts
needs the updated mappings, so it might be necessary to pull the latestmain
in theelasticsearch
repo and runyarn es source
instead ofyarn es snapshot
(requires an up-to-date Java environment, currently 23).In order to test that
upgrade_attempts
is set tonull
when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528.Checklist
release_note:*
label is applied per the guidelinesIdentify risks
Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the
enableAutomaticAgentUpgrades
feature flag.