Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Service map new API #212550

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

crespocarlos
Copy link
Contributor

@crespocarlos crespocarlos commented Feb 26, 2025

closes #212252

Summary

This PR replaces the scripted_metric aggregation used to retrieve the data for the service map.

The new solution relies on samples of exit spans - each representing a unique combination of service.name and span.destination.service.resource - along with their child transactions. The Service Map is now built entirely on the client side to reduce server-side load and prevent excessive event loop utilization.

Note

  • transform_service_map_responses.ts was refactored to improve readability and performance, The file was renamed to get_service_map_nodes.ts
  • group_resource_nodes.ts was refactored to improve readability and performance

Consequences

  • The new solution requires all exit spans to have the span.destination.service.resource field populated — with the exception of messaging systems, which may rely on span.links (not addressed in this PR)
    • A warning will be added to the trace waterfall for exit spans without span.destination.sevice.resource #212638
image image
  • When multiple services point to load balancers, they will share the same span.destination.service.resource. This could lead to incomplete paths in the map, as the path is built for the first service.name + span.destination.service.resource pair returned processed.
    • This can't be addressed, but we'll look into ways to inform the user when the logic identifies this scenario #213124
current new
image image

Analysis

The performance analysis below uses data from the edge cluster and the service_map_oom synthtrace scenario, simulating long traces. The selected date range was 24h.

Current solution

image

numeric_labels.event_loop_active: 4085.601743
numeric_labels.event_loop_utilization: 0.28716

New solution

image

numeric_labels.event_loop_active: 887.149512
numeric_labels.event_loop_utilization: 0.123929

On the client side, the most CPU-intensive operation is performed by cytoscape. The creation of service connections performs efficiently.

image

How to test

  • Add xpack.apm.serviceMapV2Enabled: true to kibana.dev.yml
  • Navigate to APM > Services Inventory > Service Map

@crespocarlos crespocarlos force-pushed the 212252-refactor-service-map branch 9 times, most recently from a7df664 to 5fa4ec0 Compare February 28, 2025 17:09
@crespocarlos crespocarlos changed the title 212252 refactor service map [APM] Service map new API Mar 3, 2025
@crespocarlos crespocarlos force-pushed the 212252-refactor-service-map branch 3 times, most recently from 2f2be9a to b648b47 Compare March 3, 2025 11:06
@crespocarlos crespocarlos force-pushed the 212252-refactor-service-map branch 2 times, most recently from 87ef10e to b58aaaa Compare March 4, 2025 15:25
return [childSpan.destination(next.serviceInstance.fields['service.name']!)];
return [
childSpan.destination(next.serviceInstance.fields['service.name']!).children(
next.serviceInstance
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exit spans are followed by a transaction. This makes the synthtrace scenario adhere to the real data

@crespocarlos crespocarlos force-pushed the 212252-refactor-service-map branch from 4bc9dd8 to 8afc9a8 Compare March 4, 2025 22:07
@crespocarlos crespocarlos marked this pull request as ready for review March 5, 2025 11:59
@crespocarlos crespocarlos requested review from a team as code owners March 5, 2025 11:59
@crespocarlos crespocarlos requested a review from a team as a code owner March 5, 2025 11:59
if (reversedConnection) {
reversedConnection.bidirectional = true;
return prev.concat({
...connection,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this have to be a new object with different property set? Just trying to understand some of the perf impact of some of these operations

Copy link
Contributor Author

@crespocarlos crespocarlos Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, considering that reverseConnection is mutated, I would say that we don't really need to create a new object here either. I doubt that there will be any noticeable performance improvement though. I'll change it anyways.

@botelastic botelastic bot added ci:project-deploy-observability Create an Observability project Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team labels Mar 5, 2025
@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

Copy link
Contributor

github-actions bot commented Mar 5, 2025

🤖 GitHub comments

Expand to view the GitHub comments

Just comment with:

  • /oblt-deploy : Deploy a Kibana instance using the Observability test environments.
  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@crespocarlos crespocarlos added release_note:skip Skip the PR/issue when compiling release notes backport:skip This commit does not require backporting labels Mar 5, 2025
@elasticmachine
Copy link
Contributor

elasticmachine commented Mar 5, 2025

💛 Build succeeded, but was flaky

  • Buildkite Build
  • Commit: f4893d5
  • Kibana Serverless Image: docker.elastic.co/kibana-ci/kibana-serverless:pr-212550-f4893d5ec3b8

Failed CI Steps

Metrics [docs]

Module Count

Fewer modules leads to a faster build time

id before after diff
apm 1890 1894 +4

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
apm 2.5MB 2.5MB +6.4KB

History


// List of nodes that are services
const serviceNodes = new Map(
[...Array.from(allNodes.values()), ...connectionFromDiscoveredServices]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.values() returns an iterator, so with spreading into an array, Array.from is redundant and adds an extra allocation that can be avoided.

Suggested change
[...Array.from(allNodes.values()), ...connectionFromDiscoveredServices]
[...allNodes.values(), ...connectionFromDiscoveredServices]


const uniqueNodes = mappedEdges
.flatMap((connection) => [connection.sourceData, connection.targetData])
.concat(Array.from(allServices.values()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.concat(Array.from(allServices.values()))
.concat(...allServices.values())

We can just spread instead of doing an additional array allocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport:skip This commit does not require backporting ci:project-deploy-observability Create an Observability project release_note:skip Skip the PR/issue when compiling release notes Team:obs-ux-infra_services Observability Infrastructure & Services User Experience Team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[APM] Refactor the Service Map removing the scripted metric agg query
4 participants