OpenTelemetry OTLP setup for tracing, take 2 #697

mzabaluev · 2024-10-15T18:19:06Z

Summary

Categories: protocol-units, networks, util.

Integrate an OTLP exporter for tracing events matching the "movement_tracing" target.
The OTLP gRPC endpoint is configured with the MOVEMENT_OTLP environment variable.

This replaces the earlier "movement_timing" tracing layer, as analysis of OpenTelemetry data should be more versatile.

Changelog

Added OpenTelemetry OTLP exporter, with the endpoint URL configured with the MOVEMENT_OTLP environment variable.
Removed support for timing JSON logs configured with MOVEMENT_TIMING.

Testing

The telemetry overlay for process-compose runs the jaeger all-in-one Docker container as a local OpenTelemetry collector.

Replace the "movement_timing" tracing target and the logging layer it targeted with an optionally installed OpenTelemetry OTLP exporter. The name of the tracing target matched to send OpenTelemetry events is "movement_telemetry".

Add the telemetry overlay enabling OTLP telemetry export in suzuka-full-node and m1-da-light-node.. In the telemetry overlay for suzuka-full-node, add an OTLP collector start job running a docker container.

Provide telemetry API as separate from tracing rather than a globally installed layer. Installing an OpenTelemetry layer into the global tracing subscriber raises nasty reentrancy issues because the OTLP exporter stack also uses tracing under the hood.

Install an OpenTelemetryLayer configured with the OTLP exporter. The tracing spans and events to export are selected by target "movement_telemetry".

The implementation of shutdown in the opentelemetry_sdk exporter calls futures_executor::block_on, which does not play well with the multithreaded Tokio runtime.

OpenTelemetry needs spans at the top level of its log event model at least.

Emit telemetry events detailing the success or failure

In the transaction_ingress task of suzuka-full-node and executor's transaction_pipe, add comments detailing which metrics the telemetry events are contributing to.

At the points where transaction is dropped in the submit_transaction method, emit telemetry events. These will help compute the transaction failure rate.

util/tracing/src/tracing.rs

l-monninger · 2024-10-18T16:50:54Z

process-compose/suzuka-full-node/process-compose.telemetry.yml

+environment:
+
+processes:
+  otlp-collector:


I don't seem to be getting anything served by this and the e2e tests for the basic simulation are failing. Is this how I should be using this?:

just suzuka-full-node native build.setup.eth-local.celestia-local.test.telemetry --keep-project

This is all I'm seeing from the collector process when I start with the above.

I added a feed overlay in case for some reason this is not in fact staying open with the --keep-project flag.

just suzuka-full-node native build.setup.eth-local.celestia-local.feed.telemetry --keep-project

l-monninger · 2024-10-18T18:27:02Z

@mzabaluev Last commit adds a test. It would be good to add to CI if you like it.

Apply the telemetry overlay when running tests in the local setup.

mzabaluev · 2024-10-21T08:48:49Z

@mzabaluev Last commit adds a test. It would be good to add to CI if you like it.

I have added the overlays to the local test job, as we need some tests to drive the exporting. Do you think it would be better to test separately?

Port the simple jaeger setup from process-compose.

Cargo.toml

elliottdehn · 2025-02-12T21:31:56Z

Make sure that downstream tools and dashboards (or any filtering logic) are updated to reflect the change from "movement_timing" to "movement_telemetry" naming, if dependencies exist.

elliottdehn · 2025-02-12T22:46:40Z

Can we document logging/tracing practices and how-to in the public repo somewhere? Or at least in internal Notion. Just so people know how to do it, best practices, examples, etc. It's a very big-tent audience of developers so good to have some documentation about it, even if somewhat limited, to guide people.

0xmovses · 2025-02-12T23:32:54Z

Might be good to commit a README.md to this PR outlining the architecture and implementation, with assistance from Mikhail. Good to start with some solid docs.

elliottdehn · 2025-02-13T04:46:18Z

I'd like us to be able to keep track of:

Which contracts are calling other contracts (kind of like recording a micro-service graph)
Perhaps which functions are calling which functions, and if we can swing it (recording a call graph)
The exclusive/inclusive cost of a function (gas of code within the function, vs. gas of code within the function + transitive calls).
How much of each gas meter is globally contributing to fee collection, possibly broken down further by various conditions (warm read vs. cold read, etc.)
This way we can actually understand and potentially optimize for the workload that we observe over time, and better see what optimization techniques at the node level might yield the best results for our specific use case. Much less speculation. There may even be cases where we can approach a specific team with optimization notes from our observations.

Is that possible to do/demonstrate in this PR, or should it be kicked down the road into another PR? This kind of data was existential at a previous company I worked at. If we open source this dataset then people can "mine it" for improvements to the node implementation(s), and possibly even yield bounties (which are measurable by delta yielded) in doing so.

I'm willing to do this myself if you're willing to write the relevant documentation (as noted above) so that I can do so! Thanks a bundle, telemetry is super existential not just for fixing things that are broken but also optimizing things that are working (perhaps even quite well).

elliottdehn · 2025-02-16T22:46:31Z

networks/movement/movement-full-node/src/node/tasks/execute_settle.rs

@@ -166,7 +168,7 @@ where
 				}
 			}
 		} else {
-			info!(block_id = ?block_id, "Skipping settlement");
+			info!(block_id = block_id_hex, "Skipping settlement");


target: "movement_telemetry" is missing here, is that intentional?

elliottdehn · 2025-02-16T22:54:56Z

networks/movement/movement-full-node/src/node/tasks/execute_settle.rs

 		info!(
-			block_id = %hex::encode(block_id.clone()),
-			da_height = da_height,
+			block_id = block_id_hex,


target: "movement_telemetry" is missing here too. What's the logic for including/not including?

mzabaluev added 5 commits October 15, 2024 17:00

feat: tracing with OpenTelemetry OTLP

583d124

Replace the "movement_timing" tracing target and the logging layer it targeted with an optionally installed OpenTelemetry OTLP exporter. The name of the tracing target matched to send OpenTelemetry events is "movement_telemetry".

feat(process-compose):telemetry overlay

57845ba

Add the telemetry overlay enabling OTLP telemetry export in suzuka-full-node and m1-da-light-node.. In the telemetry overlay for suzuka-full-node, add an OTLP collector start job running a docker container.

fix(process-compose): correctly specify enviroment

607ba0b

feat: produce some telemetry spans and events

56899ca

Install an OpenTelemetryLayer configured with the OTLP exporter. The tracing spans and events to export are selected by target "movement_telemetry".

mzabaluev requested review from l-monninger, andyjsbell and 0xmovses as code owners October 15, 2024 18:19

mzabaluev mentioned this pull request Oct 15, 2024

OpenTelemetry OTLP setup for tracing #670

Closed

fix(tracing): work around another shutdown problem

f713d71

The implementation of shutdown in the opentelemetry_sdk exporter calls futures_executor::block_on, which does not play well with the multithreaded Tokio runtime.

mzabaluev force-pushed the mikhail/opentelemetry-through-tracing branch from 3a27069 to f713d71 Compare October 16, 2024 12:56

feat: spans for each telemetry event

8ce2e07

OpenTelemetry needs spans at the top level of its log event model at least.

mzabaluev added the cicd:suzuka-full-node label Oct 16, 2024

mzabaluev added 4 commits October 17, 2024 16:12

feat(suzuka-full-node): execute_block telemetry

c915948

Emit telemetry events detailing the success or failure

chore: comments on some metrics

c1ba7fc

In the transaction_ingress task of suzuka-full-node and executor's transaction_pipe, add comments detailing which metrics the telemetry events are contributing to.

feat(opt-executor): emit telemetry on tx failure

372f9b4

At the points where transaction is dropped in the submit_transaction method, emit telemetry events. These will help compute the transaction failure rate.

feat(suzuka-full-node): telemetry on executed tx

acc202b

l-monninger requested changes Oct 18, 2024

View reviewed changes

fix: add feed.

954c8e4

mzabaluev mentioned this pull request Oct 18, 2024

Centralized Signed Blobs #604

Closed

fix: add otlp test.

e6144f8

mzabaluev added 3 commits October 21, 2024 11:26

Merge branch 'main' into mikhail/opentelemetry-through-tracing

3fed933

chore(tracing): comment on the guard

f59cb2f

feat(ci): add telemetry to local tests

2fab8d7

Apply the telemetry overlay when running tests in the local setup.

mzabaluev requested a review from l-monninger October 21, 2024 08:49

mzabaluev added 2 commits October 21, 2024 23:10

Merge branch 'main' into mikhail/opentelemetry-through-tracing

f2af21d

feat(docker-compose): add telemetry overlay

f06cf53

Port the simple jaeger setup from process-compose.

mzabaluev mentioned this pull request Oct 22, 2024

Fix local build scripts for process-compose #734

Merged

SA124 assigned mzabaluev Oct 29, 2024

0xmovses mentioned this pull request Oct 30, 2024

Setup Relayer with OpenTelemetry #776

Open

andygolay reviewed Nov 1, 2024

View reviewed changes

Cargo.toml Show resolved Hide resolved

Merge branch 'main' into mikhail/opentelemetry-through-tracing

2688dec

mzabaluev requested a review from musitdev as a code owner November 1, 2024 12:38

mzabaluev added 3 commits November 1, 2024 17:38

chore: commit Cargo.lock updates

a2e6653

Merge branch 'main' into mikhail/opentelemetry-through-tracing

1474c53

Merge branch 'main' into mikhail/opentelemetry-through-tracing

f61c345

elliottdehn reviewed Feb 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenTelemetry OTLP setup for tracing, take 2 #697

OpenTelemetry OTLP setup for tracing, take 2 #697

mzabaluev commented Oct 15, 2024 •

edited

Loading

l-monninger Oct 18, 2024

l-monninger Oct 18, 2024

l-monninger Oct 18, 2024

l-monninger commented Oct 18, 2024

mzabaluev commented Oct 21, 2024

elliottdehn commented Feb 12, 2025 •

edited

Loading

elliottdehn commented Feb 12, 2025 •

edited

Loading

0xmovses commented Feb 12, 2025

elliottdehn commented Feb 13, 2025 •

edited

Loading

elliottdehn Feb 16, 2025

elliottdehn Feb 16, 2025

OpenTelemetry OTLP setup for tracing, take 2 #697

Are you sure you want to change the base?

OpenTelemetry OTLP setup for tracing, take 2 #697

Conversation

mzabaluev commented Oct 15, 2024 • edited Loading

Summary

Changelog

Testing

l-monninger Oct 18, 2024

Choose a reason for hiding this comment

l-monninger Oct 18, 2024

Choose a reason for hiding this comment

l-monninger Oct 18, 2024

Choose a reason for hiding this comment

l-monninger commented Oct 18, 2024

mzabaluev commented Oct 21, 2024

elliottdehn commented Feb 12, 2025 • edited Loading

elliottdehn commented Feb 12, 2025 • edited Loading

0xmovses commented Feb 12, 2025

elliottdehn commented Feb 13, 2025 • edited Loading

elliottdehn Feb 16, 2025

Choose a reason for hiding this comment

elliottdehn Feb 16, 2025

Choose a reason for hiding this comment

mzabaluev commented Oct 15, 2024 •

edited

Loading

elliottdehn commented Feb 12, 2025 •

edited

Loading

elliottdehn commented Feb 12, 2025 •

edited

Loading

elliottdehn commented Feb 13, 2025 •

edited

Loading