-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
When creating or updating a metric alert and load shedding is enabled… #69224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm so far
src/sentry/features/temporary.py
Outdated
@@ -105,6 +105,7 @@ def register_temporary_features(manager: FeatureManager): | |||
manager.add("organizations:mep-use-default-tags", OrganizationFeature, FeatureHandlerStrategy.REMOTE) | |||
manager.add("organizations:metric-alert-chartcuterie", OrganizationFeature, FeatureHandlerStrategy.INTERNAL) | |||
manager.add("organizations:metric-alert-ignore-archived", OrganizationFeature, FeatureHandlerStrategy.REMOTE) | |||
manager.add("organizations:metric-alert-load-shedding", OrganizationFeature, FeatureHandlerStrategy.INTERNAL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Be my guinea pig! Set this to FeatureHandlerStrategy.OPTIONS
and follow these instructions https://develop.sentry.dev/feature-flags/#building-your-options-based-feature
What's the plan for how we'll test this though? Are you just wanting to make sure it works ok in sentry before we enable it everywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was thinking we could roll it out internally first and validate it, and then enable it wider. I wasn't sure if we could enable for all orgs at once like this or not though.
👍 i'll update the flag info stuff though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at making the PR to options automator, but got a little confused from the docs. I think that i can just add this to the option and copy pasta the result from the CLI too. It doesn't look like there are any keys with the prefix organizations
though, do i need to prefix with team name or anything there?
src/sentry/incidents/logic.py
Outdated
resolution = DEFAULT_ALERT_RULE_LOAD_SHEDDING_RESOLUTIONS.get( | ||
time_window, DEFAULT_ALERT_RULE_RESOLUTION | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be wise to structure this as ranges, so 1-29 is 1, 30-59 is 2, 60-239 is 3, 240-999 is 5, 1000-1440 is 15.
Either that or we should more strictly enforce these buckets at the api level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 i like the idea of using the ranges based on the dicts.
Yeah, I agree about the API level... it'd be nice if the API enforced the select values in the UI. I'm hoping to get a bit more time in the next quarter to work on some things like that around alerts, I'll add it to my list!
…, we should use the lookup table to set the resolution accordingly
…ution frequency) to the time_window.
58d2b04
to
3ab7ed2
Compare
… the correct resolution.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #69224 +/- ##
===========================================
+ Coverage 56.16% 79.67% +23.51%
===========================================
Files 6472 6483 +11
Lines 287451 287957 +506
Branches 49572 49637 +65
===========================================
+ Hits 161451 229441 +67990
+ Misses 125624 58145 -67479
+ Partials 376 371 -5
|
…he create / update alert functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few quick nit comments
src/sentry/incidents/logic.py
Outdated
@@ -459,7 +460,16 @@ class AlertRuleNameAlreadyUsedError(Exception): | |||
|
|||
# Default values for `SnubaQuery.resolution`, in minutes. | |||
DEFAULT_ALERT_RULE_RESOLUTION = 1 | |||
DEFAULT_CMP_ALERT_RULE_RESOLUTION = 2 | |||
DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER = 2 | |||
DEFAULT_ALERT_RULE_LOAD_SHEDDING_RESOLUTIONS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we name this something with like WINDOW_RESOLUTION_MAP
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i get that you'd want to change the name, but i'm not sure of that one.
I'd like it to have the same prefix, DEFAULT_ALERT_RULE_
and include RESOLUTION
to match the other variables. Generally i'm not a fan of including the data structure in the variable name like map
, instead using things like to
to illustrate what it's a map of.
DEFAULT_ALERT_RULE_WINDOW_TO_RESOLUTION
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that!
|
||
if features.has("organizations:metric-alert-load-shedding", organization=organization): | ||
windows = sorted(DEFAULT_ALERT_RULE_LOAD_SHEDDING_RESOLUTIONS.keys()) | ||
index = bisect.bisect_right(windows, time_window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👀 if we're utilizing index - 1
, does bisect_left work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to bisect_left would also change how the index lookup works;
For example:
>>> import bisect
>>> bisect.bisect_left([1,2,3], 2)
1
>>> bisect.bisect_right([1,2,3], 2)
2
In this case, we'd want the functionality of bisect_right to get the correct result from the dictionary.
@@ -459,7 +460,16 @@ class AlertRuleNameAlreadyUsedError(Exception): | |||
|
|||
# Default values for `SnubaQuery.resolution`, in minutes. | |||
DEFAULT_ALERT_RULE_RESOLUTION = 1 | |||
DEFAULT_CMP_ALERT_RULE_RESOLUTION = 2 | |||
DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does CMP stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, fwiw, all i did was add _MULTIPLIER
src/sentry/incidents/logic.py
Outdated
comparison_delta = int(timedelta(minutes=comparison_delta).total_seconds()) | ||
|
||
updated_query_fields["resolution"] = timedelta(minutes=resolution) | ||
updated_fields["comparison_delta"] = comparison_delta | ||
|
||
if time_window and comparison_delta is NOT_SET and alert_rule.comparison_delta is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a comment here explaining this?
"If we're updating the time_window, and the comparison_delta, AND there's an existing comparison delta we're overriding.... then update our resolution"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:squint:
wait. Can we just move this and alert_rule.comparison_delta is not None
into the previous if statement?
so we only need to set the resolution once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, i'll add a comment (almost did before i opened the review too, thanks for confirming it's needed!)
also, no we can't just move it into the previous statement because of the gate being comparison_delta is not NOT_SET
vs comparison_delta is NOT_SET
Basically what's happening here is we're updating an alerts time_window. That would mean we need to also update the resolution field, which is happening on line 745. However, it's not taking into account if there is / isn't a comparison at that point (since there's a comparison handler later). (this code likely needs a larger refactor.. and i'd love to DRY up the create_alert_rule and update_alert_rule code, but i think that's out of scope here)
That comparison handler only checks if the comparison has updated.
This statement is saying "If we just updated the resolution because of a change to the time window, make sure that it wasn't previously a comparison query. if it was, use the multiplier on the resolution"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to move the resolution logic out separately, rather than including it in the if statements. So something like:
if "comparison_delta" in updated_fields or "time_window" in updated_query_fields:
res_time_window = updated_query_fields.get("time_window", alert_rule.snuba_query.time_window)
resolution = get_alert_resolution(res_time_window, organization=alert_rule.organization)
res_comp_delta = updated_fields.get("comparison_delta", alert_rule.comparison_delta)
if res_comp_delta is not None:
resolution = resolution * DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER
if resolution != alert_rule.snuba_query.resolution:
update_query_fields["resolution"] = resolution
Rough idea, I don't think this code is 100% right but it might work? I think this avoids us having to look at NOT_SET
at all since it's handled in previous steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh duh
is NOT_SET
vs. is not NOT_SET
totally missed the lowercase not
lol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nhsiehgit yooo right? the read ability of not NOT_SET
is tricky.
@wedamija def love that refactor. thanks, i couldn't see the forrest through the trees on that one.
…RULE_WINDOW_TO_RESOLUTION
|
||
if features.has("organizations:metric-alert-load-shedding", organization=organization): | ||
windows = sorted(DEFAULT_ALERT_RULE_WINDOW_TO_RESOLUTION.keys()) | ||
index = bisect.bisect_right(windows, time_window) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of bisect here!
src/sentry/incidents/logic.py
Outdated
comparison_delta = int(timedelta(minutes=comparison_delta).total_seconds()) | ||
|
||
updated_query_fields["resolution"] = timedelta(minutes=resolution) | ||
updated_fields["comparison_delta"] = comparison_delta | ||
|
||
if time_window and comparison_delta is NOT_SET and alert_rule.comparison_delta is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it would be better to move the resolution logic out separately, rather than including it in the if statements. So something like:
if "comparison_delta" in updated_fields or "time_window" in updated_query_fields:
res_time_window = updated_query_fields.get("time_window", alert_rule.snuba_query.time_window)
resolution = get_alert_resolution(res_time_window, organization=alert_rule.organization)
res_comp_delta = updated_fields.get("comparison_delta", alert_rule.comparison_delta)
if res_comp_delta is not None:
resolution = resolution * DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER
if resolution != alert_rule.snuba_query.resolution:
update_query_fields["resolution"] = resolution
Rough idea, I don't think this code is 100% right but it might work? I think this avoids us having to look at NOT_SET
at all since it's handled in previous steps?
#69224) ## Description This PR will help address [#inc-719](https://sentry.slack.com/archives/C06V504V789) by reducing load on snuba. This PR will read the `"organizations:metric-alert-load-shedding"` feature and change the resolution of the metric alert according to the [window -> resolution mapping](https://github.com/getsentry/sentry/compare/jcallender/metric-alert-load-shedding?expand=1#diff-54ef12aa467e9e032fcdeb5d98079424d01fb28db982938d9e8db849ae78dd5eR463-R471). [Estimated savings](https://docs.google.com/spreadsheets/d/1maiAp69LHVbEDRQsub4HvL5WpfhS5AWJZ3py6oD27cQ/edit#gid=1827329865) based on this mapping, is about 87% reduction in load.
PR reverted: 68bc6b1 |
Triggering a revert due to some drops in CPU during deployment. It seems unrelated to this change, but reverting to be safe. Will investigate more int he AM. |
Description
This PR will help address #inc-719 by reducing load on snuba. This PR will read the
"organizations:metric-alert-load-shedding"
feature and change the resolution of the metric alert according to the window -> resolution mapping. Estimated savings based on this mapping, is about 87% reduction in load.