Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Set health check offset based on last updated time #43

Open
gvso opened this issue Jul 31, 2020 · 2 comments
Open

Set health check offset based on last updated time #43

gvso opened this issue Jul 31, 2020 · 2 comments

Comments

@gvso
Copy link
Contributor

gvso commented Jul 31, 2020

In #38, we introduced rollbacks and integration with the health package. When obtaining the metrics, we passed a offset of how much time in the past we want to look up (e.g. 10 minutes).

The current implementation always look back at a constant number of time (e.g. 10 minutes, 1 hour, etc.). This has the (potential) advantage that it allows us to measure the latency/error rate given an average load time (e.g. 6000 requests in 10 minutes = 10 request/second). However, it also has the disadvantage of potentially taking longer to completely rollout a new revision (or not rolling out at all if an unreachable request count is set). For instance, if the min request count is 6000 and we check with a constant offset of 10 minutes, it can happen that the new revision always get a number of requests that is below 6000 in the last 10 minutes. That is, the candidate never gets more traffic and stays a candidate for a really long time. This is especially more likely when the candidate gets small shares of the traffic (at the beginning of the rollout).

The alternative solution would be to add an annotation about the last time the candidate's traffic was increased, so we can calculate an offset = time.Now() - lastTime. This would basically determine the health based on the accumulated requests since the last roll forward.

@gvso
Copy link
Contributor Author

gvso commented Jul 31, 2020

@ahmetb @grayside Any thoughts on this?

@gvso gvso changed the title Make health check offset based on last updated time Set health check offset based on last updated time Jul 31, 2020
@ahmetb
Copy link
Contributor

ahmetb commented Aug 3, 2020

I think the "we look at last N minutes" approach is easier to wrap your mind around (for users) and easier for us to implement (no need to keep track of last checked date).

For crowded services, setting a decent <min request, last N minutes> pair should be fairly easy for the operator of the service.

After all the point of "min requests" criterion is to ensure there were at least some requests and the monitoring signals aren't anecdotal.

Maybe ship with "default min requests=0" (i.e. no such requirement).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants