-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatically detect deleted resources #1386
base: master
Are you sure you want to change the base?
Automatically detect deleted resources #1386
Conversation
67d710e
to
9d08cc0
Compare
Thanks @pedro-martins ! |
7764d2c
to
6e76708
Compare
@pedro-martins and @chungg I have rebased this PR and it is ready for your reviews |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Rafael, nice feature, I just leave some comments here, but the code overall is good to me. Thanks for this patch.
Thanks @pedro-martins for your review! |
@chungg thanks for your review! I have added the code changes you suggested, and there are some remarks open, which I would like to hear your response before closing them. |
Hello @chungg is there something else that needs to be addressed here? |
Hello guys, This patch is the base for some further optimizations that we are doing and that we would like to propose upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm ok with this although i think it'd be better if the update logic was in sql rather than python. something like following but in orm. could also be done later.
update resource r set ended_at = %s where not exists (select 1 from metric m where m.resource_id=a.id AND m.last_measure_timestamp >= %s) and r.ended_at is null;
gnocchi/indexer/alembic/versions/f89ed2e3c2ec_create_last_measure_timestamp_column.py
Outdated
Show resolved
Hide resolved
@chungg thanks for your review again. I guess there is nothing else to update here. Can we merge the patch? We have other on top of this, that we would like to start working on. |
9a7a623
to
1c4c7e9
Compare
While executing some Gnocchi optimizations (gnocchixyz#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API. This patch introduces a new parameter called `metric_inactive_after`, which defines for how long a metric can go without receiving new datapoints until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
Co-authored-by: gord chung <5091603+chungg@users.noreply.github.com>
0dc60e5
to
84d9d91
Compare
Hello guys, When I run here in my machine, it works 😄 |
I haven't done any investigation into the cause, but it seems like there's some kind of race condition for some specific tests. It's not ideal, but I've just been force-pushing unchanged commits on my PRs until the tests pass. |
So it is not just me :) |
Yep, I've got retries set up in our internal CI pipelines for our Gnocchi deployment for the same reason. There also seems to be a recurring error where the uWSGI build fails due to some weird system-related issues such as |
8330f57
to
5d0bb33
Compare
Looking at the code, it looks like there's a separate https://github.com/gnocchixyz/gnocchi/blob/master/gnocchi/tests/functional/fixtures.py#L264-L279 100ms is actually a decently long amount of time due to the speed at which these tests are run. The particular tests that fail are probably running their checks before the metricd threads has processed them, and just fail instead of rechecking. |
5d0bb33
to
cd1d649
Compare
Cool, that gave me an idea. What do you think of a config like this one then in the tests? This would repeat the test if it fails, and wait 1 seconds. Therefore, we could guarantee that the MetricD will already have been executed in the next try. |
That's a great idea. There's normally two tests that are affected by that race condition so if we could retry them when they fail, that should solve it nicely without changing the current behaviour much. |
Thanks for the help @Callum027. @chungg and @tobias-urdin what do you think about this patch now? |
This patch is created on top of #1385, #1381, #1387, and #1388. Therefore, we need to merge them first. Then, the extra commits here will disappear.
While executing some Gnocchi optimizations (#1307), we noticed that some deleted/removed resources do not have the "ended_at" field with a datetime. This can cause slowness with time, as more and more "zombie" resources are left there, and this has a direct impact in the MySQL queries executed with the aggregates API.
This patch introduces a new parameter called
metric_inactive_after
, which defines for how long a metric can go without receiving new data points until we consider it as inactive. Then, when all metrics of a resource are in inactive state, we can mark/consider the resource as removed.