Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry OTLP setup for tracing, take 2 #697

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

mzabaluev
Copy link
Collaborator

@mzabaluev mzabaluev commented Oct 15, 2024

Resolves: #555 #556 #559

Summary

  • Categories: protocol-units, networks, util.

Integrate an OTLP exporter for tracing events matching the "movement_tracing" target.
The OTLP gRPC endpoint is configured with the MOVEMENT_OTLP environment variable.

This replaces the earlier "movement_timing" tracing layer, as analysis of OpenTelemetry data should be more versatile.

Changelog

  • Added OpenTelemetry OTLP exporter, with the endpoint URL configured with the MOVEMENT_OTLP environment variable.
  • Removed support for timing JSON logs configured with MOVEMENT_TIMING.

Testing

The telemetry overlay for process-compose runs the jaeger all-in-one Docker container as a local OpenTelemetry collector.

Replace the "movement_timing" tracing target and the logging layer
it targeted with an optionally installed OpenTelemetry OTLP exporter.
The name of the tracing target matched to send OpenTelemetry events is
"movement_telemetry".
Add the telemetry overlay enabling OTLP telemetry
export in suzuka-full-node and m1-da-light-node..

In the telemetry overlay for suzuka-full-node, add
an OTLP collector start job running a docker container.
Provide telemetry API as separate from tracing rather than
a globally installed layer. Installing an OpenTelemetry layer into the
global tracing subscriber raises nasty reentrancy issues because
the OTLP exporter stack also uses tracing under the hood.
Install an OpenTelemetryLayer configured with the OTLP exporter.
The tracing spans and events to export are selected by target
"movement_telemetry".
The implementation of shutdown in the opentelemetry_sdk exporter
calls futures_executor::block_on, which does not play well with
the multithreaded Tokio runtime.
@mzabaluev mzabaluev force-pushed the mikhail/opentelemetry-through-tracing branch from 3a27069 to f713d71 Compare October 16, 2024 12:56
OpenTelemetry needs spans at the top level of
its log event model at least.
Emit telemetry events detailing the success or failure
In the transaction_ingress task of suzuka-full-node
and executor's transaction_pipe, add comments detailing
which metrics the telemetry events are contributing to.
At the points where transaction is dropped in the submit_transaction
method, emit telemetry events. These will help compute the
transaction failure rate.
environment:

processes:
otlp-collector:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't seem to be getting anything served by this and the e2e tests for the basic simulation are failing. Is this how I should be using this?:

just suzuka-full-node native build.setup.eth-local.celestia-local.test.telemetry --keep-project

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2024-10-18 at 10 49 29 AM

This is all I'm seeing from the collector process when I start with the above.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a feed overlay in case for some reason this is not in fact staying open with the --keep-project flag.

just suzuka-full-node native build.setup.eth-local.celestia-local.feed.telemetry --keep-project

@l-monninger
Copy link
Collaborator

@mzabaluev Last commit adds a test. It would be good to add to CI if you like it.

@mzabaluev
Copy link
Collaborator Author

@mzabaluev Last commit adds a test. It would be good to add to CI if you like it.

I have added the overlays to the local test job, as we need some tests to drive the exporting. Do you think it would be better to test separately?

@mzabaluev mzabaluev requested a review from l-monninger October 21, 2024 08:49
@mzabaluev mzabaluev requested a review from musitdev as a code owner November 1, 2024 12:38
@elliottdehn
Copy link

elliottdehn commented Feb 12, 2025

Make sure that downstream tools and dashboards (or any filtering logic) are updated to reflect the change from "movement_timing" to "movement_telemetry" naming, if dependencies exist.

@elliottdehn
Copy link

elliottdehn commented Feb 12, 2025

Can we document logging/tracing practices and how-to in the public repo somewhere? Or at least in internal Notion. Just so people know how to do it, best practices, examples, etc. It's a very big-tent audience of developers so good to have some documentation about it, even if somewhat limited, to guide people.

@0xmovses
Copy link
Contributor

Might be good to commit a README.md to this PR outlining the architecture and implementation, with assistance from Mikhail. Good to start with some solid docs.

@elliottdehn
Copy link

elliottdehn commented Feb 13, 2025

I'd like us to be able to keep track of:

  • Which contracts are calling other contracts (kind of like recording a micro-service graph)
  • Perhaps which functions are calling which functions, and if we can swing it (recording a call graph)
  • The exclusive/inclusive cost of a function (gas of code within the function, vs. gas of code within the function + transitive calls).
  • How much of each gas meter is globally contributing to fee collection, possibly broken down further by various conditions (warm read vs. cold read, etc.)
    This way we can actually understand and potentially optimize for the workload that we observe over time, and better see what optimization techniques at the node level might yield the best results for our specific use case. Much less speculation. There may even be cases where we can approach a specific team with optimization notes from our observations.

Is that possible to do/demonstrate in this PR, or should it be kicked down the road into another PR? This kind of data was existential at a previous company I worked at. If we open source this dataset then people can "mine it" for improvements to the node implementation(s), and possibly even yield bounties (which are measurable by delta yielded) in doing so.

I'm willing to do this myself if you're willing to write the relevant documentation (as noted above) so that I can do so! Thanks a bundle, telemetry is super existential not just for fixing things that are broken but also optimizing things that are working (perhaps even quite well).

@@ -166,7 +168,7 @@ where
}
}
} else {
info!(block_id = ?block_id, "Skipping settlement");
info!(block_id = block_id_hex, "Skipping settlement");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target: "movement_telemetry" is missing here, is that intentional?

info!(
block_id = %hex::encode(block_id.clone()),
da_height = da_height,
block_id = block_id_hex,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

target: "movement_telemetry" is missing here too. What's the logic for including/not including?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit OpenTelepathy events for ingress transactions
5 participants