Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement crashLoopBackOff for failed processes #64

Merged
merged 7 commits into from
Aug 10, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,6 +119,18 @@ job "foo" {
}
```

Set `maxAttempts` to `-1` will restart the process "forever".

```hcl
job "foo" {
command = "/usr/local/bin/foo"
args = ["bar"]
maxAttempts = -1
canFail = false
workingDirectory = "/some/path"
}
```

You can append a custom environment to the process by setting `env`:

```hcl
Expand Down
17 changes: 17 additions & 0 deletions pkg/proc/crashloopbackoff.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
package proc

import (
"time"
)

// calculates the next backOff based on the current backOff
func calculateNextBackOff(currBackOff, maxBackoff time.Duration) time.Duration {
if currBackOff.Seconds() <= 1 {
return 2 * time.Second
}
next := time.Duration(2*currBackOff.Seconds()) * time.Second
if next.Seconds() > maxBackoff.Seconds() {
return maxBackoff
}
return next
}
33 changes: 30 additions & 3 deletions pkg/proc/job_common.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,13 @@ import (
log "github.com/sirupsen/logrus"
)

const (
// longest duration between two restarts
maxBackOff = 300 * time.Second
// retries before next backOff interval is used
maxBackOffRetries = 5
)

func (job *CommonJob) Init() {
job.restart = false
job.stop = false
Expand Down Expand Up @@ -48,10 +55,17 @@ func (job *CommonJob) Run(ctx context.Context, _ chan<- error) error {
attempts := 0
maxAttempts := job.Config.MaxAttempts

if maxAttempts <= 0 {
backOff := 1 * time.Second
backOffRetries := 0

if maxAttempts == 0 {
maxAttempts = 3
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (I know from the diff that that's not new behaviour, but still) it's come to my attention that it's annoying for some users (see #62) that you cannot actually set maxAttempts to 0 (because that counts as "not set", and will default to 3). Should we maybe think about solving this differently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, there are some possibilities that make sense in my opinion. But I can't decide which one is the "best":

  1. the current method from this pr since by default I still have 3 tries if maxAttempts is not set. With -1 you can set infinite.

  2. maxAttempts = 0 means infinite. Against this, however, is that the behavior changes to before and that could possibly be unexpected as well. In addition, you cannot set the job to be executed only once.

  3. maxAttempts = ' actually means 0 attempts. Then infinity would still mean -1. I think this is the nicest solution, but it is also not backward compatible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also lean towards option 3; one thing I could (possibly) think of would be to change the maxAttempts type in the configuration *int (if supported by the HCL parser), so that we could differentiate between "not set" (maxAttempts == nil) and "explicitly set to 0" (*maxAttempts == 0)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Done.


if maxAttempts < 0 {
maxAttempts = -1
}

p := make(chan *os.Process)
defer close(p)

Expand Down Expand Up @@ -83,9 +97,22 @@ func (job *CommonJob) Run(ctx context.Context, _ chan<- error) error {
}

attempts++
if attempts < maxAttempts {
if maxAttempts == -1 || attempts < maxAttempts {
currBackOff := backOff
backOffRetries++
if backOffRetries > maxBackOffRetries {
backOff = calculateNextBackOff(currBackOff, maxBackOff)
backOffRetries = 0
}

job.phase.Set(JobPhaseReasonCrashLooping)
l.WithField("job.maxAttempts", maxAttempts).WithField("job.usedAttempts", attempts).Info("remaining attempts")
l.
WithField("job.maxAttempts", maxAttempts).
WithField("job.usedAttempts", attempts).
WithField("job.nextRestartIn", currBackOff.String()).
Info("remaining attempts")

time.Sleep(currBackOff)
continue
}

Expand Down