Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect deleted resources #1386

Open
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

rafaelweingartner
Copy link
Contributor

@rafaelweingartner rafaelweingartner commented May 28, 2024

This patch is created on top of #1385, #1381, #1387, and #1388. Therefore, we need to merge them first. Then, the extra commits here will disappear.

While executing some Gnocchi optimizations (#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called metric_inactive_after, which defines for how long a metric can go without receiving new data points until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch 2 times, most recently from 67d710e to 9d08cc0 Compare May 28, 2024 20:24
@rafaelweingartner
Copy link
Contributor Author

Thanks @pedro-martins !

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 7764d2c to 6e76708 Compare July 3, 2024 15:16
@rafaelweingartner
Copy link
Contributor Author

@pedro-martins and @chungg I have rebased this PR and it is ready for your reviews

Copy link
Contributor

@pedro-martins pedro-martins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Rafael, nice feature, I just leave some comments here, but the code overall is good to me. Thanks for this patch.

@rafaelweingartner
Copy link
Contributor Author

Thanks @pedro-martins for your review!

@rafaelweingartner
Copy link
Contributor Author

@chungg thanks for your review! I have added the code changes you suggested, and there are some remarks open, which I would like to hear your response before closing them.

@rafaelweingartner
Copy link
Contributor Author

Hello @chungg is there something else that needs to be addressed here?

@rafaelweingartner
Copy link
Contributor Author

Hello guys,
Is there something missing here to move on and merge it?

This patch is the base for some further optimizations that we are doing and that we would like to propose upstream.

chungg
chungg previously approved these changes Nov 22, 2024
Copy link
Member

@chungg chungg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm ok with this although i think it'd be better if the update logic was in sql rather than python. something like following but in orm. could also be done later.

update resource r set ended_at = %s where not exists (select 1 from metric m where m.resource_id=a.id AND m.last_measure_timestamp >= %s) and r.ended_at is null;

@mergify mergify bot dismissed chungg’s stale review February 11, 2025 16:55

Pull request has been modified.

@rafaelweingartner
Copy link
Contributor Author

@chungg thanks for your review again. I guess there is nothing else to update here. Can we merge the patch? We have other on top of this, that we would like to start working on.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 9a7a623 to 1c4c7e9 Compare February 13, 2025 13:52
rafaelweingartner and others added 11 commits February 13, 2025 10:53
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.

This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch 2 times, most recently from 0dc60e5 to 84d9d91 Compare February 14, 2025 16:36
@rafaelweingartner
Copy link
Contributor Author

Hello guys,
the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?

When I run here in my machine, it works 😄

@Callum027
Copy link
Contributor

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?

When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

@rafaelweingartner
Copy link
Contributor Author

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?
When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

So it is not just me :)

@Callum027
Copy link
Contributor

Hello guys, the failing tests do not seem to be related to the patch itself. Do you have an idea on how to fix it? @chungg and @tobias-urdin ?
When I run here in my machine, it works 😄

I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass.

So it is not just me :)

Yep, I've got retries set up in our internal CI pipelines for our Gnocchi deployment for the same reason.

There also seems to be a recurring error where the uWSGI build fails due to some weird system-related issues such as /bin/sh failing to run. In this case it seems like flakiness specific to the GitHub Actions runners.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 8330f57 to 5d0bb33 Compare February 14, 2025 18:26
@Callum027
Copy link
Contributor

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.

https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279

100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

@rafaelweingartner rafaelweingartner force-pushed the mark-resource-as-deleted-when-they-stop-receiving-measures branch from 5d0bb33 to cd1d649 Compare February 17, 2025 12:19
@rafaelweingartner
Copy link
Contributor Author

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.

https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279

100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

Cool, that gave me an idea. What do you think of a config like this one then in the tests?
8fc168f

This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try.

@Callum027
Copy link
Contributor

Looking at the code, it looks like there's a separate metricd thread that processes incoming measures in an infinite loop with a 0.1 second delay betwen the iterations.
https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279
100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking.

Cool, that gave me an idea. What do you think of a config like this one then in the tests? 8fc168f

This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try.

That's a great idea. There's normally two tests that are affected by that race condition so if we could retry them when they fail, that should solve it nicely without changing the current behaviour much.

@rafaelweingartner
Copy link
Contributor Author

Thanks for the help @Callum027. @chungg and @tobias-urdin what do you think about this patch now?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants