Skip to content

When creating or updating a metric alert and load shedding is enabled… #69224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Apr 24, 2024

Conversation

saponifi3d
Copy link
Contributor

Description

This PR will help address #inc-719 by reducing load on snuba. This PR will read the "organizations:metric-alert-load-shedding" feature and change the resolution of the metric alert according to the window -> resolution mapping. Estimated savings based on this mapping, is about 87% reduction in load.

@github-actions github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Apr 18, 2024
Copy link
Member

@wedamija wedamija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm so far

@@ -105,6 +105,7 @@ def register_temporary_features(manager: FeatureManager):
manager.add("organizations:mep-use-default-tags", OrganizationFeature, FeatureHandlerStrategy.REMOTE)
manager.add("organizations:metric-alert-chartcuterie", OrganizationFeature, FeatureHandlerStrategy.INTERNAL)
manager.add("organizations:metric-alert-ignore-archived", OrganizationFeature, FeatureHandlerStrategy.REMOTE)
manager.add("organizations:metric-alert-load-shedding", OrganizationFeature, FeatureHandlerStrategy.INTERNAL)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be my guinea pig! Set this to FeatureHandlerStrategy.OPTIONS and follow these instructions https://develop.sentry.dev/feature-flags/#building-your-options-based-feature

What's the plan for how we'll test this though? Are you just wanting to make sure it works ok in sentry before we enable it everywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was thinking we could roll it out internally first and validate it, and then enable it wider. I wasn't sure if we could enable for all orgs at once like this or not though.

👍 i'll update the flag info stuff though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at making the PR to options automator, but got a little confused from the docs. I think that i can just add this to the option and copy pasta the result from the CLI too. It doesn't look like there are any keys with the prefix organizations though, do i need to prefix with team name or anything there?

Comment on lines 487 to 489
resolution = DEFAULT_ALERT_RULE_LOAD_SHEDDING_RESOLUTIONS.get(
time_window, DEFAULT_ALERT_RULE_RESOLUTION
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be wise to structure this as ranges, so 1-29 is 1, 30-59 is 2, 60-239 is 3, 240-999 is 5, 1000-1440 is 15.

Either that or we should more strictly enforce these buckets at the api level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 i like the idea of using the ranges based on the dicts.

Yeah, I agree about the API level... it'd be nice if the API enforced the select values in the UI. I'm hoping to get a bit more time in the next quarter to work on some things like that around alerts, I'll add it to my list!

…, we should use the lookup table to set the resolution accordingly
@saponifi3d saponifi3d force-pushed the jcallender/metric-alert-load-shedding branch from 58d2b04 to 3ab7ed2 Compare April 24, 2024 00:10
Copy link

codecov bot commented Apr 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.67%. Comparing base (76fd454) to head (878f1d9).
Report is 56 commits behind head on master.

❗ Current head 878f1d9 differs from pull request most recent head 983f0fa. Consider uploading reports for the commit 983f0fa to get more accurate results

Additional details and impacted files
@@             Coverage Diff             @@
##           master   #69224       +/-   ##
===========================================
+ Coverage   56.16%   79.67%   +23.51%     
===========================================
  Files        6472     6483       +11     
  Lines      287451   287957      +506     
  Branches    49572    49637       +65     
===========================================
+ Hits       161451   229441    +67990     
+ Misses     125624    58145    -67479     
+ Partials      376      371        -5     
Files Coverage Δ
src/sentry/conf/server.py 89.18% <ø> (ø)
src/sentry/features/temporary.py 100.00% <100.00%> (ø)
src/sentry/incidents/logic.py 95.47% <100.00%> (+68.56%) ⬆️

... and 2002 files with indirect coverage changes

@saponifi3d saponifi3d marked this pull request as ready for review April 24, 2024 05:09
@saponifi3d saponifi3d requested a review from a team as a code owner April 24, 2024 05:09
Copy link
Contributor

@nhsiehgit nhsiehgit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few quick nit comments

@@ -459,7 +460,16 @@ class AlertRuleNameAlreadyUsedError(Exception):

# Default values for `SnubaQuery.resolution`, in minutes.
DEFAULT_ALERT_RULE_RESOLUTION = 1
DEFAULT_CMP_ALERT_RULE_RESOLUTION = 2
DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER = 2
DEFAULT_ALERT_RULE_LOAD_SHEDDING_RESOLUTIONS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we name this something with like WINDOW_RESOLUTION_MAP or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i get that you'd want to change the name, but i'm not sure of that one.

I'd like it to have the same prefix, DEFAULT_ALERT_RULE_ and include RESOLUTION to match the other variables. Generally i'm not a fan of including the data structure in the variable name like map, instead using things like to to illustrate what it's a map of.

DEFAULT_ALERT_RULE_WINDOW_TO_RESOLUTION ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that!


if features.has("organizations:metric-alert-load-shedding", organization=organization):
windows = sorted(DEFAULT_ALERT_RULE_LOAD_SHEDDING_RESOLUTIONS.keys())
index = bisect.bisect_right(windows, time_window)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀 if we're utilizing index - 1, does bisect_left work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to bisect_left would also change how the index lookup works;

For example:

>>> import bisect
>>> bisect.bisect_left([1,2,3], 2)
1
>>> bisect.bisect_right([1,2,3], 2)
2

In this case, we'd want the functionality of bisect_right to get the correct result from the dictionary.

@@ -459,7 +460,16 @@ class AlertRuleNameAlreadyUsedError(Exception):

# Default values for `SnubaQuery.resolution`, in minutes.
DEFAULT_ALERT_RULE_RESOLUTION = 1
DEFAULT_CMP_ALERT_RULE_RESOLUTION = 2
DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does CMP stand for?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comparison?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fwiw, all i did was add _MULTIPLIER

comparison_delta = int(timedelta(minutes=comparison_delta).total_seconds())

updated_query_fields["resolution"] = timedelta(minutes=resolution)
updated_fields["comparison_delta"] = comparison_delta

if time_window and comparison_delta is NOT_SET and alert_rule.comparison_delta is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here explaining this?

"If we're updating the time_window, and the comparison_delta, AND there's an existing comparison delta we're overriding.... then update our resolution"
?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:squint:

wait. Can we just move this and alert_rule.comparison_delta is not None into the previous if statement?
so we only need to set the resolution once?

Copy link
Contributor Author

@saponifi3d saponifi3d Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah, i'll add a comment (almost did before i opened the review too, thanks for confirming it's needed!)

also, no we can't just move it into the previous statement because of the gate being comparison_delta is not NOT_SET vs comparison_delta is NOT_SET


Basically what's happening here is we're updating an alerts time_window. That would mean we need to also update the resolution field, which is happening on line 745. However, it's not taking into account if there is / isn't a comparison at that point (since there's a comparison handler later). (this code likely needs a larger refactor.. and i'd love to DRY up the create_alert_rule and update_alert_rule code, but i think that's out of scope here)

That comparison handler only checks if the comparison has updated.

This statement is saying "If we just updated the resolution because of a change to the time window, make sure that it wasn't previously a comparison query. if it was, use the multiplier on the resolution"

Copy link
Member

@wedamija wedamija Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to move the resolution logic out separately, rather than including it in the if statements. So something like:

if "comparison_delta" in updated_fields or "time_window" in updated_query_fields:
    res_time_window = updated_query_fields.get("time_window", alert_rule.snuba_query.time_window)
    resolution = get_alert_resolution(res_time_window, organization=alert_rule.organization)
    res_comp_delta = updated_fields.get("comparison_delta", alert_rule.comparison_delta)
    if res_comp_delta is not None:
        resolution = resolution * DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER
    if resolution != alert_rule.snuba_query.resolution:
        update_query_fields["resolution"] = resolution   

Rough idea, I don't think this code is 100% right but it might work? I think this avoids us having to look at NOT_SET at all since it's handled in previous steps?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh duh
is NOT_SET vs. is not NOT_SET
totally missed the lowercase not lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nhsiehgit yooo right? the read ability of not NOT_SET is tricky.


@wedamija def love that refactor. thanks, i couldn't see the forrest through the trees on that one.


if features.has("organizations:metric-alert-load-shedding", organization=organization):
windows = sorted(DEFAULT_ALERT_RULE_WINDOW_TO_RESOLUTION.keys())
index = bisect.bisect_right(windows, time_window)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of bisect here!

comparison_delta = int(timedelta(minutes=comparison_delta).total_seconds())

updated_query_fields["resolution"] = timedelta(minutes=resolution)
updated_fields["comparison_delta"] = comparison_delta

if time_window and comparison_delta is NOT_SET and alert_rule.comparison_delta is not None:
Copy link
Member

@wedamija wedamija Apr 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to move the resolution logic out separately, rather than including it in the if statements. So something like:

if "comparison_delta" in updated_fields or "time_window" in updated_query_fields:
    res_time_window = updated_query_fields.get("time_window", alert_rule.snuba_query.time_window)
    resolution = get_alert_resolution(res_time_window, organization=alert_rule.organization)
    res_comp_delta = updated_fields.get("comparison_delta", alert_rule.comparison_delta)
    if res_comp_delta is not None:
        resolution = resolution * DEFAULT_CMP_ALERT_RULE_RESOLUTION_MULTIPLIER
    if resolution != alert_rule.snuba_query.resolution:
        update_query_fields["resolution"] = resolution   

Rough idea, I don't think this code is 100% right but it might work? I think this avoids us having to look at NOT_SET at all since it's handled in previous steps?

@saponifi3d saponifi3d merged commit b6e043b into master Apr 24, 2024
48 checks passed
@saponifi3d saponifi3d deleted the jcallender/metric-alert-load-shedding branch April 24, 2024 23:32
MichaelSun48 pushed a commit that referenced this pull request Apr 25, 2024
#69224)

## Description

This PR will help address
[#inc-719](https://sentry.slack.com/archives/C06V504V789) by reducing
load on snuba. This PR will read the
`"organizations:metric-alert-load-shedding"` feature and change the
resolution of the metric alert according to the [window -> resolution
mapping](https://github.com/getsentry/sentry/compare/jcallender/metric-alert-load-shedding?expand=1#diff-54ef12aa467e9e032fcdeb5d98079424d01fb28db982938d9e8db849ae78dd5eR463-R471).
[Estimated
savings](https://docs.google.com/spreadsheets/d/1maiAp69LHVbEDRQsub4HvL5WpfhS5AWJZ3py6oD27cQ/edit#gid=1827329865)
based on this mapping, is about 87% reduction in load.
@saponifi3d saponifi3d added the Trigger: Revert Add to a merged PR to revert it (skips CI) label Apr 25, 2024
@getsentry-bot
Copy link
Contributor

PR reverted: 68bc6b1

getsentry-bot added a commit that referenced this pull request Apr 25, 2024
… enabled… (#69224)"

This reverts commit b6e043b.

Co-authored-by: saponifi3d <1569818+saponifi3d@users.noreply.github.com>
@saponifi3d
Copy link
Contributor Author

Triggering a revert due to some drops in CPU during deployment. It seems unrelated to this change, but reverting to be safe. Will investigate more int he AM.

@saponifi3d saponifi3d restored the jcallender/metric-alert-load-shedding branch April 25, 2024 17:18
@saponifi3d saponifi3d deleted the jcallender/metric-alert-load-shedding branch April 25, 2024 17:50
@github-actions github-actions bot locked and limited conversation to collaborators May 11, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Scope: Backend Automatically applied to PRs that change backend components Trigger: Revert Add to a merged PR to revert it (skips CI)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants