-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry OTLP setup for tracing, take 2 #697
base: main
Are you sure you want to change the base?
Conversation
Replace the "movement_timing" tracing target and the logging layer it targeted with an optionally installed OpenTelemetry OTLP exporter. The name of the tracing target matched to send OpenTelemetry events is "movement_telemetry".
Add the telemetry overlay enabling OTLP telemetry export in suzuka-full-node and m1-da-light-node.. In the telemetry overlay for suzuka-full-node, add an OTLP collector start job running a docker container.
Provide telemetry API as separate from tracing rather than a globally installed layer. Installing an OpenTelemetry layer into the global tracing subscriber raises nasty reentrancy issues because the OTLP exporter stack also uses tracing under the hood.
Install an OpenTelemetryLayer configured with the OTLP exporter. The tracing spans and events to export are selected by target "movement_telemetry".
The implementation of shutdown in the opentelemetry_sdk exporter calls futures_executor::block_on, which does not play well with the multithreaded Tokio runtime.
3a27069
to
f713d71
Compare
OpenTelemetry needs spans at the top level of its log event model at least.
Emit telemetry events detailing the success or failure
In the transaction_ingress task of suzuka-full-node and executor's transaction_pipe, add comments detailing which metrics the telemetry events are contributing to.
At the points where transaction is dropped in the submit_transaction method, emit telemetry events. These will help compute the transaction failure rate.
environment: | ||
|
||
processes: | ||
otlp-collector: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't seem to be getting anything served by this and the e2e tests for the basic simulation are failing. Is this how I should be using this?:
just suzuka-full-node native build.setup.eth-local.celestia-local.test.telemetry --keep-project
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a feed
overlay in case for some reason this is not in fact staying open with the --keep-project
flag.
just suzuka-full-node native build.setup.eth-local.celestia-local.feed.telemetry --keep-project
@mzabaluev Last commit adds a test. It would be good to add to CI if you like it. |
Apply the telemetry overlay when running tests in the local setup.
I have added the overlays to the local test job, as we need some tests to drive the exporting. Do you think it would be better to test separately? |
Port the simple jaeger setup from process-compose.
Make sure that downstream tools and dashboards (or any filtering logic) are updated to reflect the change from "movement_timing" to "movement_telemetry" naming, if dependencies exist. |
Can we document logging/tracing practices and how-to in the public repo somewhere? Or at least in internal Notion. Just so people know how to do it, best practices, examples, etc. It's a very big-tent audience of developers so good to have some documentation about it, even if somewhat limited, to guide people. |
Might be good to commit a README.md to this PR outlining the architecture and implementation, with assistance from Mikhail. Good to start with some solid docs. |
I'd like us to be able to keep track of:
Is that possible to do/demonstrate in this PR, or should it be kicked down the road into another PR? This kind of data was existential at a previous company I worked at. If we open source this dataset then people can "mine it" for improvements to the node implementation(s), and possibly even yield bounties (which are measurable by delta yielded) in doing so. I'm willing to do this myself if you're willing to write the relevant documentation (as noted above) so that I can do so! Thanks a bundle, telemetry is super existential not just for fixing things that are broken but also optimizing things that are working (perhaps even quite well). |
@@ -166,7 +168,7 @@ where | |||
} | |||
} | |||
} else { | |||
info!(block_id = ?block_id, "Skipping settlement"); | |||
info!(block_id = block_id_hex, "Skipping settlement"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target: "movement_telemetry"
is missing here, is that intentional?
info!( | ||
block_id = %hex::encode(block_id.clone()), | ||
da_height = da_height, | ||
block_id = block_id_hex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
target: "movement_telemetry"
is missing here too. What's the logic for including/not including?
Resolves: #555 #556 #559
Summary
protocol-units
,networks
,util
.Integrate an OTLP exporter for tracing events matching the "movement_tracing" target.
The OTLP gRPC endpoint is configured with the MOVEMENT_OTLP environment variable.
This replaces the earlier "movement_timing" tracing layer, as analysis of OpenTelemetry data should be more versatile.
Changelog
MOVEMENT_OTLP
environment variable.MOVEMENT_TIMING
.Testing
The
telemetry
overlay for process-compose runs the jaeger all-in-one Docker container as a local OpenTelemetry collector.