[Fleet] Add retry logic to automatic agent upgrades #212744

jillguyonnet · 2025-02-28T10:33:46Z

Summary

Relates https://github.com/elastic/ingest-dev/issues/4720

This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019.

Complementary fleet-server change which sets the agent's upgrade_attempts to null once the upgrade is complete.: elastic/fleet-server#4528

Approach

A new upgrade_attempts property is added to agents and stored in the agent doc (ES mapping update in [Fleet] Add upgrade_attempts to .fleet-agents index elasticsearch#123256).
When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' upgrade_attempts.
The default retry delays are ['30m', '1h', '2h', '4h', '8h', '16h', '24h'] and can be overridden with the new xpack.fleet.autoUpgrades.retryDelays setting.
On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. https://github.com/elastic/ingest-dev/issues/4720#issuecomment-2671660795).
Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried.

Testing

The ES query for fetching agents with existing upgrade_attempts needs the updated mappings, so it might be necessary to pull the latest main in the elasticsearch repo and run yarn es source instead of yarn es snapshot (requires an up-to-date Java environment, currently 23).

In order to test that upgrade_attempts is set to null when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528.

Checklist

Unit or functional tests were updated or added to match the most common scenarios
The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the enableAutomaticAgentUpgrades feature flag.

docs/settings/fleet-settings.asciidoc

jillguyonnet · 2025-02-28T10:43:13Z

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

@@ -362,7 +434,8 @@ export class AutomaticAgentUpgradeTask {
    return (
      isAgentUpgradeable(agent) &&
      (agent.status !== 'updating' ||
-        (AgentStatusKueryHelper.isStuckInUpdating(agent) &&
+        (!agent.upgrade_attempts &&


One important piece of this logic is that we can't just pick up agents stuck in updating now since we're explicitly handling retries. I'm allowing agents stuck in updating with no upgrade_attempts (i.e. that were not upgraded through this task) to be selected.

One thing that came to mind while implementing this is that we could probably make this task more performant by modifying the kuery for selecting candidate agents for upgrade: instead of allowing all status:updating agents, we could restrict it to only allow status:updating AND NOT upgrade_attempts:* AND upgrade_details.target_version:{version} AND upgrade_details.state:UPG_FAILED. The only difference would be that it wouldn't pick up agents on older versions without upgrade_details that are stuck in updating.

@juliaElastic Pinging you on this comment for your thoughts on it.

We could include those stuck agents without upgrade_details in a date range query, something like upgrade_started_at < now-2h.
One caveat with these agents is we don't have target_version on them, so we don't necessarily auto upgrade them to the same version as initially, but I think it's still better to try to upgrade them rather than keep them stuck.

Thanks for the suggestion, pushed a change in this direction.

jillguyonnet · 2025-02-28T10:49:05Z

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

@@ -42,7 +44,7 @@ export const TYPE = 'fleet:automatic-agent-upgrade-task';
 export const VERSION = '1.0.0';
 const TITLE = 'Fleet Automatic agent upgrades';
 const SCOPE = ['fleet'];
-const INTERVAL = '1h';
+const INTERVAL = '30m';


Changing this to the default min retry delay.

…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/fleet --include-path /api/dashboards --update'

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

x-pack/platform/plugins/shared/fleet/server/services/agents/upgrade_action_runner.ts

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>

jillguyonnet · 2025-03-03T13:29:51Z

@elasticmachine merge upstream

juliaElastic · 2025-03-04T12:12:51Z

Tested locally with horde agents, and getting this error, it looks like missing upgrade_details doesn't work well with the query.

[2025-03-04T13:09:05.797+01:00][ERROR][plugins.fleet.fleet:automatic-agent-upgrade-task:1.0.0] [AutomaticAgentUpgradeTask] Error: ResponseError: search_phase_execution_exception
        Root causes:
                query_shard_exception: failed to create query: [nested] failed to find nested object under path [upgrade_details.target_version]

I could get past it by adding the nestedIgnoreUnmapped parameter here:
https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/fleet/server/services/agents/crud.ts#L414

const query = kueryNode ? { query: toElasticsearchQuery(kueryNode, undefined, {nestedIgnoreUnmapped: true}) } : {};

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

juliaElastic · 2025-03-04T14:09:23Z

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts

+  }
+
+  private isAgentReadyForRetry(agent: Agent, agentPolicy: AgentPolicy) {
+    if (!agent.upgrade_attempts) {


When testing this, it seems the upgrade_attempts array keeps being extended with more items even after max attempts are exceeded, and way faster than the retry delays.
Not sure where the bug is, I'm testing with an agent doc that's manually updated to be in failed upgrade state.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Thanks for reporting this. I'm currently troubleshooting with a real agent and I think with the change in elastic/fleet-server#4528 upgrade_attempts gets cleared after a failed upgrade, which is not the intent. Let me troubleshoot this further and see if I can reproduce the issue you are seeing.

I've done more testing with real agents and a new approach in elastic/fleet-server#4528. Retrying seems to work for agents with upgrade details (it's hitting the max retry attempts exceeded for agent log as expected). Not sure if that's due to my last fixes or if I'm missing some flow.

There is however currently a gap for agents with no upgrade details: as you pointed out in #212744 (comment), the ES queries don't work as expected. Furthermore, the current fleet-server change relies on upgrade details, so retries don't work for these agents. I'm still looking at how to improve queries, but I'd welcome some feedback on the approach.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Currently the logic does this:

count how many agents need to be on the target version

subtract number of agents already on it or updating to it

subtract number of agents marked for retry (with upgrade_attempts set)

if this is still a positive number, fetch more agents with findAndUpgradeCandidateAgents

findAndUpgradeCandidateAgents should definitely not attempt to upgrade an agent that was marked for retry. In theory the AND (NOT upgrade_attempts:*) part of the agents fetcher kuery should prevent that (I fixed that). Do you have any concerns about this?

I think it's okay if we don't retry agents without upgrade details, just count them as updating.
The logic sounds good, we should be able to test that it works as expected.

Should we avoid setting upgrade_attempts on agents below 8.12.0 then? If we do that, they would naturally get retried by the task (with no limit to the number of attempts).

I think that sounds reasonable.

Pushed the change, along with a fix (the task was erroring if no agents were returned by the fetcher).

I've tested again with real agents below and above 8.12.0, as well as horde agents. I'm not seeing any issues, please let me know if you do.

jillguyonnet · 2025-03-06T15:25:19Z

@elasticmachine merge upstream

elasticmachine · 2025-03-06T15:29:34Z

Pinging @elastic/fleet (Team:Fleet)

elasticmachine · 2025-03-06T15:51:34Z

💔 Build Failed

Buildkite Build
Commit: 16c0627

Failed CI Steps

Check Types

History

💔 Build #281390 failed f0d0833
💔 Build #279915 failed 7207089

cc @jillguyonnet

juliaElastic

LGTM

jillguyonnet self-assigned this Feb 28, 2025

This comment was marked as outdated.

Sign in to view

jillguyonnet commented Feb 28, 2025

View reviewed changes

docs/settings/fleet-settings.asciidoc Outdated Show resolved Hide resolved

jillguyonnet commented Feb 28, 2025

View reviewed changes

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 57a093b to f066052 Compare February 28, 2025 10:44

This comment was marked as outdated.

Sign in to view

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from f066052 to 835b23d Compare February 28, 2025 10:45

This comment was marked as outdated.

Sign in to view

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 835b23d to 5b2c811 Compare February 28, 2025 10:46

This comment was marked as outdated.

Sign in to view

[Fleet] Add retry logic to automatic agent upgrades

28582f7

jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 5b2c811 to 28582f7 Compare February 28, 2025 10:48

This comment was marked as outdated.

Sign in to view

jillguyonnet commented Feb 28, 2025

View reviewed changes

jillguyonnet added Team:Fleet Team label for Observability Data Collection Fleet team release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting v9.1.0 labels Feb 28, 2025

Remove asciidoc change

7207089

jillguyonnet requested a review from juliaElastic February 28, 2025 10:57

jillguyonnet mentioned this pull request Feb 28, 2025

Clear agent.upgrade_attempts on upgrade complete elastic/fleet-server#4528

Open

8 tasks

juliaElastic reviewed Feb 28, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Show resolved Hide resolved

juliaElastic reviewed Feb 28, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/services/agents/upgrade_action_runner.ts Outdated Show resolved Hide resolved

Feedback

08aaf9c

juliaElastic reviewed Mar 3, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Outdated Show resolved Hide resolved

Fix oldStuckInUpdatingKuery

1839863

Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>

Merge branch 'main' into fleet/4720-automatic-upgrades-retry

6e57501

juliaElastic reviewed Mar 4, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Outdated Show resolved Hide resolved

juliaElastic reviewed Mar 4, 2025

View reviewed changes

x-pack/platform/plugins/shared/fleet/server/tasks/automatic_agent_upgrade_task.ts Outdated Show resolved Hide resolved

juliaElastic reviewed Mar 4, 2025

View reviewed changes

jillguyonnet and others added 5 commits March 6, 2025 12:56

Fix queries

e183081

Fix query

f0d0833

[CI] Auto-commit changed files from 'make api-docs'

a5812e5

Only set upgrade_attempts if agent has upgrade_details

d0bde15

Check if agentsFetcher returned any agents

831883e

Merge branch 'main' into fleet/4720-automatic-upgrades-retry

16c0627

jillguyonnet marked this pull request as ready for review March 6, 2025 15:29

jillguyonnet requested a review from a team as a code owner March 6, 2025 15:29

Fix config type

f7a59cc

juliaElastic approved these changes Mar 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fleet] Add retry logic to automatic agent upgrades #212744

[Fleet] Add retry logic to automatic agent upgrades #212744

jillguyonnet commented Feb 28, 2025 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

jillguyonnet Feb 28, 2025

jillguyonnet Feb 28, 2025

juliaElastic Feb 28, 2025

jillguyonnet Feb 28, 2025 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

jillguyonnet Feb 28, 2025

jillguyonnet commented Mar 3, 2025

juliaElastic commented Mar 4, 2025 •

edited

Loading

juliaElastic Mar 4, 2025 •

edited

Loading

jillguyonnet Mar 4, 2025

jillguyonnet Mar 6, 2025 •

edited

Loading

juliaElastic Mar 6, 2025

jillguyonnet Mar 6, 2025

juliaElastic Mar 6, 2025

jillguyonnet Mar 6, 2025

jillguyonnet commented Mar 6, 2025

elasticmachine commented Mar 6, 2025

elasticmachine commented Mar 6, 2025 •

edited

Loading

juliaElastic left a comment

[Fleet] Add retry logic to automatic agent upgrades #212744

Are you sure you want to change the base?

[Fleet] Add retry logic to automatic agent upgrades #212744

Conversation

jillguyonnet commented Feb 28, 2025 • edited Loading

Summary

Approach

Testing

Checklist

Identify risks

This comment was marked as outdated.

This comment was marked as outdated.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jillguyonnet Feb 28, 2025 • edited Loading

Choose a reason for hiding this comment

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

Choose a reason for hiding this comment

jillguyonnet commented Mar 3, 2025

juliaElastic commented Mar 4, 2025 • edited Loading

juliaElastic Mar 4, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jillguyonnet Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jillguyonnet commented Mar 6, 2025

elasticmachine commented Mar 6, 2025

elasticmachine commented Mar 6, 2025 • edited Loading

💔 Build Failed

Failed CI Steps

History

juliaElastic left a comment

Choose a reason for hiding this comment

jillguyonnet commented Feb 28, 2025 •

edited

Loading

jillguyonnet Feb 28, 2025 •

edited

Loading

juliaElastic commented Mar 4, 2025 •

edited

Loading

juliaElastic Mar 4, 2025 •

edited

Loading

jillguyonnet Mar 6, 2025 •

edited

Loading

elasticmachine commented Mar 6, 2025 •

edited

Loading