Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Allow collection CPU profiles when collecting agent diagnostics #162849

Closed
Tracked by #3640
cmacknz opened this issue Jul 31, 2023 · 15 comments · Fixed by #179819
Closed
Tracked by #3640

[Fleet] Allow collection CPU profiles when collecting agent diagnostics #162849

cmacknz opened this issue Jul 31, 2023 · 15 comments · Fixed by #179819
Assignees
Labels
QA:Ready for Testing Code is merged and ready for QA to validate Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@cmacknz
Copy link
Member

cmacknz commented Jul 31, 2023

elastic/elastic-agent#2140 restored the ability of the elastic-agent diagnostics command to collect CPU profiles, which is critically important when diagnosing CPU performance problems.

CPU profile collection is optional and is done when the --cpu-profiles option is provided like elastic-agent diagnostics --cpu-profile.

Add a check box or toggle allowing users to opt into collecting CPU profiles when collecting diagnostics through Fleet. The diagnostics action will need to be modified to include this new boolean switch.

Also note that each CPU profile takes approximately 30s to capture, so it is likely that the timeout for the diagnostics action will need to be increased when this option is selected.

Depends on elastic/elastic-agent#3491

@cmacknz cmacknz added the Team:Fleet Team label for Observability Data Collection Fleet team label Jul 31, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@jlind23
Copy link
Contributor

jlind23 commented Aug 28, 2023

@zombieFox could you please help us out here with the design?
cc @nimarezainia

@jlind23
Copy link
Contributor

jlind23 commented Sep 22, 2023

@nimarezainia @zombieFox I don't see any update here, we are we at? Are we just going down the path of adding a simple checkbox in the UI below?

Image

@nimarezainia
Copy link
Contributor

We should use the ability to generate UI from config to do this. I'll add it to the list.

(fyi @amitkanfer @kpollich )

@amitkanfer
Copy link

where's the list ? :)

@jlind23
Copy link
Contributor

jlind23 commented Sep 28, 2023

Discussed today with @zombieFox and @nimarezainia, we all agreed to start with a checkbox to enable CPU profile collection. Hence moving this to a ready state.

@jlind23
Copy link
Contributor

jlind23 commented Sep 28, 2023

@kpollich @juliaElastic for the request diagnostic, can we send the --cpu-profiles options as part of the action data here ? Or do we need to change the method signature to supply external options?

@juliaElastic
Copy link
Contributor

I think we have to extend the REQUEST_DIAGNOSTICS action with a parameter to pass if cpu-profiles should be included. This requires a change in kibana and fleet-server.

@jlind23
Copy link
Contributor

jlind23 commented Sep 28, 2023

So this should be available only if fleet server is in the right version though.

@juliaElastic
Copy link
Contributor

juliaElastic commented Sep 28, 2023

Checked this further and I think it also needs changes on elastic-agent side, to add a cpuProfile flag to the request diagnostics action and apply the flag when collecting diagnostics in the handle function.
@cmacknz I think the agent changes are a prerequisite of the kibana and fleet-server changes.

@cmacknz
Copy link
Member Author

cmacknz commented Sep 29, 2023

Correct the change needs to be made across Fleet, Fleet Server, and Elastic Agent. Prior to making the Elastic Agent change the new parameter in the diagnostics action needs to be defined.

You can't test any of this without implementing the whole system, the easiest option is to have whoever makes the Fleet Server (and possibly Fleet) change also change the agent. This is just passing in a function argument that already exists in the right place. We could also have multiple people do this, that's just more coordination overhead.

I wrote up how to do this in detail in elastic/elastic-agent#3491.

If the Fleet changes come first and are merged into a snapshot image we can write an E2E test for it in the agent repository. This could be done after manual verification end to end though.

@jlind23
Copy link
Contributor

jlind23 commented Oct 2, 2023

@cmacknz changed the area for: elastic/elastic-agent#3491 As I believe one of the fleet engineer can do both Elastic Agent and Fleet Server change at the same time.

@nimarezainia
Copy link
Contributor

moving this to sprint 27 as the item blocking it is in sprint 26

@michel-laterman
Copy link

Changes to fleet-server are here: elastic/fleet-server#3333
the openapi spec has been updated so diagnostics actions can have data.additional_metrics: []string.
In order to gather CPU profiles "CPU" needs to be included in the list.

@kpollich
Copy link
Member

We should add a confirmation modal to the request diagnostics flow that allows the user to check a checkbox for Collect additional CPU metrics. This should follow the conventions elsewhere where we create Agent actions.

juliaElastic added a commit that referenced this issue Apr 3, 2024
## Summary

Closes #162849

Added option to request agent diagnostics with CPU metrics. This is a
new checkbox in Request diagnostics modal. When clicked, the action will
contain a `additional_metrics: ["CPU"]`.
Also replaced the click handler of `Request diagnotics .zip` button in
Agent Details / Diagnostics tab, to open the same modal instead of
immediately triggering the action.

To verify:
- Enroll 2 agents with version `8.14-SNAPSHOT`
- Click on Agent diagnostics action in Agent list / Agent details 
- Verify that the CPU checkbox appears in the modal window
- Verify that the diagnotics zip contains `cpu.pprof` file
- Verify the same in bulk request diagnostics action when selecting 2
agents

<img width="1412" alt="image"
src="https://github.com/elastic/kibana/assets/90178898/4b8ee884-42e7-4da1-91cb-903fa05a2abf">

Action created in `.fleet-actions`:
```
      {
        "_index": ".fleet-actions-7",
        "_id": "224e484f-fc49-4fba-989e-8ac30db29eff",
        "_score": 1,
        "_source": {
          "@timestamp": "2024-04-02T12:24:49.062Z",
          "expiration": "2024-04-02T12:27:49.062Z",
          "agents": [
            "b4a5d936-265c-40b1-be6a-d13ca91bed36"
          ],
          "action_id": "1fc01f99-c511-49bf-9350-96ddf537b562",
          "data": {
            "additional_metrics": [
              "CPU"
            ]
          },
          "type": "REQUEST_DIAGNOSTICS",
          "traceparent": "00-a5e3c645397964952efbeceed896e9db-03730cd3f6d60955-01"
        }
      },
```

### Checklist

Delete any items that are not applicable to this PR.

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
@juliaElastic juliaElastic added the QA:Ready for Testing Code is merged and ready for QA to validate label Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
QA:Ready for Testing Code is merged and ready for QA to validate Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants