Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use linear buckets in some places #3384

Merged
merged 1 commit into from
Feb 24, 2025

Conversation

ndr-ds
Copy link
Contributor

@ndr-ds ndr-ds commented Feb 21, 2025

Motivation

We currently use exponential buckets everywhere. As much as they're good in the sense that they generate less buckets, and can be cheaper on Grafana Cloud, etc, they make our buckets super wide, which makes our Prometheus data less accurate.

Proposal

I need to spend some time later looking at the testnet data for these different metrics to adjust the buckets, but for not just changing proxy/server latencies to use linear buckets instead.
I also changed the default starting value to 0.001 as 1 microsecond should be enough for most of what we're measuring, and will take at least one bucket away from metrics using this.

Test Plan

Run a validator locally, see the metrics exported with the new buckets

Release Plan

  • Nothing to do / These changes follow the usual release cycle.

Copy link
Contributor Author

ndr-ds commented Feb 21, 2025

@ndr-ds ndr-ds force-pushed the 02-20-use_linear_buckets_in_some_places branch from 3e55e27 to 9faa4f1 Compare February 21, 2025 02:09
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from 43ce4d1 to f48ee51 Compare February 21, 2025 14:52
@ndr-ds ndr-ds force-pushed the 02-20-use_linear_buckets_in_some_places branch from 9faa4f1 to f0d0d79 Compare February 21, 2025 14:52
@ndr-ds ndr-ds force-pushed the 02-19-have_one_thread_per_chain_on_benchmark branch from f48ee51 to fd0602b Compare February 21, 2025 14:53
@ndr-ds ndr-ds force-pushed the 02-20-use_linear_buckets_in_some_places branch from f0d0d79 to 5dcb117 Compare February 21, 2025 14:53
@ndr-ds ndr-ds changed the base branch from 02-19-have_one_thread_per_chain_on_benchmark to graphite-base/3384 February 21, 2025 15:28
@ndr-ds ndr-ds force-pushed the 02-20-use_linear_buckets_in_some_places branch from 5dcb117 to 6984d86 Compare February 21, 2025 15:28
@ndr-ds ndr-ds force-pushed the graphite-base/3384 branch from fd0602b to cfefa52 Compare February 21, 2025 15:28
@ndr-ds ndr-ds changed the base branch from graphite-base/3384 to main February 21, 2025 15:28
@ndr-ds ndr-ds force-pushed the 02-20-use_linear_buckets_in_some_places branch from 6984d86 to a44c253 Compare February 21, 2025 15:28
Copy link
Contributor

@MathieuDutSik MathieuDutSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not use the buckets, instead I am using sum and the numbers which gets me more raw data.

With that being said, while a linear scaling might be better for many cases, I think exponential is better for latency. So, I do not like the global change from exponential to linear.

Copy link
Contributor Author

ndr-ds commented Feb 21, 2025

Why is exponential better for latency? It gives us less accurate latency values, as I explained in the PR description

@ndr-ds ndr-ds force-pushed the 02-20-use_linear_buckets_in_some_places branch from d4bb1fb to a44c253 Compare February 21, 2025 20:20
@MathieuDutSik
Copy link
Contributor

Why is exponential better for latency? It gives us less accurate latency values, as I explained in the PR description

I thought the latency was more in an exponential curves, but if you see it differently fine.
But what about the number of buckets? Could we have too much data in the Prometheus.

If it were up to me, I would remove the buckets altogether. I see it mostly as something useful for presentation.

Copy link
Contributor Author

ndr-ds commented Feb 24, 2025

So the issue here is that the wider our buckets, the less accurate our quantiles will be (p50, p90, p99, etc), which is not good, because that's what we'll be looking at in our dashboards to monitor the system. So we need to make our buckets as narrow as possible, without generating too many buckets, which can be expensive when using Grafana Cloud, for example. These were starting at 0.0001 ms and it had a base 3 exponential growth. Now it starts at 1 ms, and grows linearly for both the proxy and server latencies. We're probably not interested in latencies below 1 ms for these, so this is fine. This will generate less buckets than before, and the buckets will be narrower so our data will be more accurate.
So this is a win on every aspect IMHO.

Copy link
Contributor

@MathieuDutSik MathieuDutSik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve because there is less buckets with this system than the preceding.

Copy link
Contributor Author

ndr-ds commented Feb 24, 2025

Merge activity

  • Feb 24, 11:56 AM EST: A user started a stack merge that includes this pull request via Graphite.
  • Feb 24, 11:56 AM EST: A user merged this pull request with Graphite.

@ndr-ds ndr-ds merged commit 55b9e8d into main Feb 24, 2025
47 checks passed
@ndr-ds ndr-ds deleted the 02-20-use_linear_buckets_in_some_places branch February 24, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants