Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add additional_parameters to diagnostics action #3333

Merged
merged 5 commits into from
Mar 12, 2024

Conversation

michel-laterman
Copy link
Contributor

What is the problem this PR solves?

Allow fleet-server to specify if a diagnostics bundle should contain additional CPU metrics.

How does this PR solve the problem?

Add an optional parameter to the RequestDiagnostics action. This parameter is a list of strings (enum) that has a single value "CPU". If this parameter is specified, the agent should collect a CPU trace.

Design Checklist

  • I have ensured my design is stateless and will work when multiple fleet-server instances are behind a load balancer.
  • I have or intend to scale test my changes, ensuring it will work reliably with 100K+ agents connected.
  • I have included fail safe mechanisms to limit the load on fleet-server: rate limiting, circuit breakers, caching, load shedding, etc.

Checklist

  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool

Related issues

@michel-laterman michel-laterman added enhancement New feature or request Team:Fleet Label for the Fleet team labels Mar 11, 2024
@michel-laterman michel-laterman requested a review from a team as a code owner March 11, 2024 09:32
Copy link
Contributor

@juliaElastic juliaElastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Is there an elastic-agent change needed to make use of the CPU option?

@michel-laterman
Copy link
Contributor Author

The agent change is elastic/elastic-agent#4394

@michel-laterman
Copy link
Contributor Author

I'm going to hold off on merging this until I can get an e2e test working

@michel-laterman
Copy link
Contributor Author

michel-laterman commented Mar 11, 2024

I managed to do an e2e test.

I started 8.14.0-SNAPSHOT builds locally then enrolled a new fleet-server running under the elastic agent with changes from elastic/elastic-agent#4394

First I validated that the request diagnostics button on the UI works as intended (changes in the eefb64c commit were needed for that).

Then I created a user with permissions to index documents in .fleet-actions and created a new document with:

POST .fleet-actions-7/_doc
{
  "@timestamp": "2024-03-11T16:36:08.791Z",
  "expiration": "2024-03-11T16:39:08.791Z",
  "agents": [
    "717e2602-3901-4ca2-b310-7d488a7631b0"
  ],
  "action_id": "cpu-tester",
  "type": "REQUEST_DIAGNOSTICS",
  "data": {
    "additional_metrics": [
      "CPU"
    ]
  }
}

The resulting diagnostics bundle was available through the UI, and it contains cpu.pprof files

 elastic-agent-diagnostics-2024-03-11T16-36-29Z-00|⇒ ls -la
total 2120
drwxr-xr-x@ 21 laterman  staff     672 11 Mar 17:38 .
drwx------@ 47 laterman  staff    1504 11 Mar 17:37 ..
-rw-r--r--@  1 laterman  staff    6148 11 Mar 17:38 .DS_Store
-rwxr-xr-x@  1 laterman  staff   32297 11 Mar 17:36 allocs.pprof.gz
-rwxr-xr-x@  1 laterman  staff     591 11 Mar 17:36 block.pprof.gz
drwxr-xr-x@  8 laterman  staff     256 11 Mar 17:37 components
-rwxr-xr-x@  1 laterman  staff  455938 11 Mar 17:36 components-actual.yaml
-rwxr-xr-x@  1 laterman  staff  455938 11 Mar 17:36 components-expected.yaml
-rwxr-xr-x@  1 laterman  staff   16510 11 Mar 17:36 computed-config.yaml
-rwxr-xr-x@  1 laterman  staff    2718 11 Mar 17:36 cpu.pprof
-rwxr-xr-x@  1 laterman  staff    7199 11 Mar 17:36 goroutine.pprof.gz
-rwxr-xr-x@  1 laterman  staff   32224 11 Mar 17:36 heap.pprof.gz
-rwxr-xr-x@  1 laterman  staff    6029 11 Mar 17:36 local-config.yaml
drwxr-xr-x@  3 laterman  staff      96 11 Mar 17:37 logs
-rwxr-xr-x@  1 laterman  staff     592 11 Mar 17:36 mutex.pprof.gz
-rwxr-xr-x@  1 laterman  staff       7 11 Mar 17:36 package.version
-rwxr-xr-x@  1 laterman  staff   17677 11 Mar 17:36 pre-config.yaml
-rwxr-xr-x@  1 laterman  staff    5872 11 Mar 17:36 state.yaml
-rwxr-xr-x@  1 laterman  staff     748 11 Mar 17:36 threadcreate.pprof.gz
-rwxr-xr-x@  1 laterman  staff    1226 11 Mar 17:36 variables.yaml
-rwxr-xr-x@  1 laterman  staff     113 11 Mar 17:36 version.txt

elastic-agent-diagnostics-2024-03-11T16-36-29Z-00|⇒ ls -la components/fleet-server-default
total 144
drwxr-xr-x@ 12 laterman  staff    384 11 Mar 17:37 .
drwxr-xr-x@  8 laterman  staff    256 11 Mar 17:37 ..
-rwxr-xr-x@  1 laterman  staff  19046 11 Mar 17:37 allocs.pprof.gz
-rwxr-xr-x@  1 laterman  staff    395 11 Mar 17:37 block.pprof.gz
-rwxr-xr-x@  1 laterman  staff   3139 11 Mar 17:37 cpu.pprof
drwxr-xr-x@  2 laterman  staff     64 11 Mar 17:37 fleet-server-default
drwxr-xr-x@  2 laterman  staff     64 11 Mar 17:37 fleet-server-fleet_server-d314b1f8-223b-4212-8bd4-1b6183179d20
-rwxr-xr-x@  1 laterman  staff   7855 11 Mar 17:37 fleet-server.yml
-rwxr-xr-x@  1 laterman  staff   7513 11 Mar 17:37 goroutine.pprof.gz
-rwxr-xr-x@  1 laterman  staff  18964 11 Mar 17:37 heap.pprof.gz
-rwxr-xr-x@  1 laterman  staff    395 11 Mar 17:37 mutex.pprof.gz
-rwxr-xr-x@  1 laterman  staff    590 11 Mar 17:37 threadcreate.pprof.gz

Copy link

@@ -549,7 +549,14 @@ components:
# unenroll actions have no `data` attribute
actionRequestDiagnostics:
description: The REQUEST_DIAGNOSTICS action data.
# diagnostics actions have no `data` attribute
properties:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to double check, this is backwards compatible with older agents that don't understand this yet? They'll just ignore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
I ran another e2e test by enrolling an agent without the changes in elastic/elastic-agent#4394 to a fleet-server instance build with the changes in this PR and manually inserting an action with CPU in additional_metrics.
A diagnostics bundle was generated with no cpu.pprof as a part of it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@michel-laterman michel-laterman merged commit 067d1c8 into elastic:main Mar 12, 2024
8 checks passed
@michel-laterman michel-laterman deleted the diagnostics-cpu branch March 12, 2024 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Team:Fleet Label for the Fleet team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants