Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Add retry logic to automatic agent upgrades #212744

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jillguyonnet
Copy link
Contributor

@jillguyonnet jillguyonnet commented Feb 28, 2025

Summary

Relates https://github.com/elastic/ingest-dev/issues/4720

This PR adds retry logic to the task that handles automatic agent upgrades originally implemented in #211019.

Complementary fleet-server change which sets the agent's upgrade_attempts to null once the upgrade is complete.: elastic/fleet-server#4528

Approach

  • A new upgrade_attempts property is added to agents and stored in the agent doc (ES mapping update in [Fleet] Add upgrade_attempts to .fleet-agents index elasticsearch#123256).
  • When a bulk upgrade action is sent from the automatic upgrade task, it pushes the timestamp of the upgrade to the affected agents' upgrade_attempts.
  • The default retry delays are ['30m', '1h', '2h', '4h', '8h', '16h', '24h'] and can be overridden with the new xpack.fleet.autoUpgrades.retryDelays setting.
  • On every run, the automatic upgrade task will first process retries and then query more agents if necessary (cf. https://github.com/elastic/ingest-dev/issues/4720#issuecomment-2671660795).
  • Once an agent has completed and failed the max retries defined by the retry delays array, it is no longer retried.

Testing

The ES query for fetching agents with existing upgrade_attempts needs the updated mappings, so it might be necessary to pull the latest main in the elasticsearch repo and run yarn es source instead of yarn es snapshot (requires an up-to-date Java environment, currently 23).

In order to test that upgrade_attempts is set to null when the upgrade is complete, fleet-server should be run in dev using the change in elastic/fleet-server#4528.

Checklist

  • Unit or functional tests were updated or added to match the most common scenarios
  • The PR description includes the appropriate Release Notes section, and the correct release_note:* label is applied per the guidelines

Identify risks

Low probability risk of incorrectly triggering agent upgrades. This feature is currently behind the enableAutomaticAgentUpgrades feature flag.

@jillguyonnet jillguyonnet self-assigned this Feb 28, 2025

This comment was marked as outdated.

This comment was marked as outdated.

@@ -362,7 +434,8 @@ export class AutomaticAgentUpgradeTask {
return (
isAgentUpgradeable(agent) &&
(agent.status !== 'updating' ||
(AgentStatusKueryHelper.isStuckInUpdating(agent) &&
(!agent.upgrade_attempts &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One important piece of this logic is that we can't just pick up agents stuck in updating now since we're explicitly handling retries. I'm allowing agents stuck in updating with no upgrade_attempts (i.e. that were not upgraded through this task) to be selected.

One thing that came to mind while implementing this is that we could probably make this task more performant by modifying the kuery for selecting candidate agents for upgrade: instead of allowing all status:updating agents, we could restrict it to only allow status:updating AND NOT upgrade_attempts:* AND upgrade_details.target_version:{version} AND upgrade_details.state:UPG_FAILED. The only difference would be that it wouldn't pick up agents on older versions without upgrade_details that are stuck in updating.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juliaElastic Pinging you on this comment for your thoughts on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could include those stuck agents without upgrade_details in a date range query, something like upgrade_started_at < now-2h.
One caveat with these agents is we don't have target_version on them, so we don't necessarily auto upgrade them to the same version as initially, but I think it's still better to try to upgrade them rather than keep them stuck.

Copy link
Contributor Author

@jillguyonnet jillguyonnet Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, pushed a change in this direction.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 57a093b to f066052 Compare February 28, 2025 10:44

This comment was marked as outdated.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from f066052 to 835b23d Compare February 28, 2025 10:45

This comment was marked as outdated.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 835b23d to 5b2c811 Compare February 28, 2025 10:46

This comment was marked as outdated.

@jillguyonnet jillguyonnet force-pushed the fleet/4720-automatic-upgrades-retry branch from 5b2c811 to 28582f7 Compare February 28, 2025 10:48

This comment was marked as outdated.

@@ -42,7 +44,7 @@ export const TYPE = 'fleet:automatic-agent-upgrade-task';
export const VERSION = '1.0.0';
const TITLE = 'Fleet Automatic agent upgrades';
const SCOPE = ['fleet'];
const INTERVAL = '1h';
const INTERVAL = '30m';
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to the default min retry delay.

@jillguyonnet jillguyonnet added Team:Fleet Team label for Observability Data Collection Fleet team release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting v9.1.0 labels Feb 28, 2025
…t --include-path /api/status --include-path /api/alerting/rule/ --include-path /api/alerting/rules --include-path /api/actions --include-path /api/security/role --include-path /api/spaces --include-path /api/fleet --include-path /api/dashboards --update'
Co-authored-by: Julia Bardi <90178898+juliaElastic@users.noreply.github.com>
@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@juliaElastic
Copy link
Contributor

juliaElastic commented Mar 4, 2025

Tested locally with horde agents, and getting this error, it looks like missing upgrade_details doesn't work well with the query.

[2025-03-04T13:09:05.797+01:00][ERROR][plugins.fleet.fleet:automatic-agent-upgrade-task:1.0.0] [AutomaticAgentUpgradeTask] Error: ResponseError: search_phase_execution_exception
        Root causes:
                query_shard_exception: failed to create query: [nested] failed to find nested object under path [upgrade_details.target_version]

I could get past it by adding the nestedIgnoreUnmapped parameter here:
https://github.com/elastic/kibana/blob/main/x-pack/platform/plugins/shared/fleet/server/services/agents/crud.ts#L414

const query = kueryNode ? { query: toElasticsearchQuery(kueryNode, undefined, {nestedIgnoreUnmapped: true}) } : {};

}

private isAgentReadyForRetry(agent: Agent, agentPolicy: AgentPolicy) {
if (!agent.upgrade_attempts) {
Copy link
Contributor

@juliaElastic juliaElastic Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When testing this, it seems the upgrade_attempts array keeps being extended with more items even after max attempts are exceeded, and way faster than the retry delays.
Not sure where the bug is, I'm testing with an agent doc that's manually updated to be in failed upgrade state.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting this. I'm currently troubleshooting with a real agent and I think with the change in elastic/fleet-server#4528 upgrade_attempts gets cleared after a failed upgrade, which is not the intent. Let me troubleshoot this further and see if I can reproduce the issue you are seeing.

Copy link
Contributor Author

@jillguyonnet jillguyonnet Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done more testing with real agents and a new approach in elastic/fleet-server#4528. Retrying seems to work for agents with upgrade details (it's hitting the max retry attempts exceeded for agent log as expected). Not sure if that's due to my last fixes or if I'm missing some flow.

There is however currently a gap for agents with no upgrade details: as you pointed out in #212744 (comment), the ES queries don't work as expected. Furthermore, the current fleet-server change relies on upgrade details, so retries don't work for these agents. I'm still looking at how to improve queries, but I'd welcome some feedback on the approach.

It seems if the agent is not picked up for retry, it gets upgraded in findAndUpgradeCandidateAgents. Probably we shouldn't upgrade there if upgrade_attempts.length > 0

Currently the logic does this:

  • count how many agents need to be on the target version
  • subtract number of agents already on it or updating to it
  • subtract number of agents marked for retry (with upgrade_attempts set)
  • if this is still a positive number, fetch more agents with findAndUpgradeCandidateAgents

findAndUpgradeCandidateAgents should definitely not attempt to upgrade an agent that was marked for retry. In theory the AND (NOT upgrade_attempts:*) part of the agents fetcher kuery should prevent that (I fixed that). Do you have any concerns about this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's okay if we don't retry agents without upgrade details, just count them as updating.
The logic sounds good, we should be able to test that it works as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid setting upgrade_attempts on agents below 8.12.0 then? If we do that, they would naturally get retried by the task (with no limit to the number of attempts).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that sounds reasonable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed the change, along with a fix (the task was erroring if no agents were returned by the fetcher).

I've tested again with real agents below and above 8.12.0, as well as horde agents. I'm not seeing any issues, please let me know if you do.

@jillguyonnet
Copy link
Contributor Author

@elasticmachine merge upstream

@jillguyonnet jillguyonnet marked this pull request as ready for review March 6, 2025 15:29
@jillguyonnet jillguyonnet requested a review from a team as a code owner March 6, 2025 15:29
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 6, 2025

💔 Build Failed

Failed CI Steps

History

cc @jillguyonnet

Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting release_note:skip Skip the PR/issue when compiling release notes Team:Fleet Team label for Observability Data Collection Fleet team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants