Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: don't crash in cache refresh/update with nil fsnotify watcher. #254

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

klihub
Copy link
Contributor

@klihub klihub commented Feb 22, 2025

Don't crash if fsnotify.Watcher creation fails, for instance due to being out of file descriptors. See moby/buildkit#5767 for an example.

Fixes #253.

@klihub klihub requested review from elezar and bart0sh February 22, 2025 15:09
Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Don't crash in update() if we fail to create an fsnotify watch.
This can happen if we have too many open files. In this case we
now record a failure for all configured spec directories and in
update we always trigger a refresh. If the process if ever able
to create new file descriptors the cache becomes functional but
in a 'always implicitly fully refreshed' mode instead of auto-
refreshed.

It's not entirely clear what is the best option to deal with a
failed watch creation. Being out of file descriptors typically
results in a cascading chain of errors which the process does
not usually survive.

This fix aims for minimal footprint. On failed watch creation
it does not render the cache fully unusable. If the process is
ever able to create new file descriptors again the cache also
becomes functional, but instead of autorefreshed mode it will
be in an 'always implicitly fully refreshed' mode.

Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
// (but with autoRefresh left on). One known case when this can happen is
// if we have too many open files. In that case we always return true and
// force a refresh.
if w.watcher == nil {
Copy link
Contributor

@bart0sh bart0sh Feb 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to check it earlier, e.g. when w.watcher is about to be assigned to nil? Or we want to keep it nil in a hope that it will be changed at some point?

@@ -812,6 +814,102 @@ devices:
}
}

func TestTooManyOpenFiles(t *testing.T) {
if runtime.GOOS != "linux" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to move this test to the cache_test_linux.go ? Would it work on other *nixes or Darwin?

Comment on lines +868 to +911
func triggerEmfile() (*emfile, error) {
fdsize, err := getProcStatusFdSize()
if err != nil {
return nil, err
}

em := &emfile{}

if err := syscall.Getrlimit(syscall.RLIMIT_NOFILE, &em.limit); err != nil {
return nil, err
}

limit := em.limit
limit.Cur = fdsize

if err := syscall.Setrlimit(syscall.RLIMIT_NOFILE, &limit); err != nil {
return nil, err
}

for i := uint64(0); i < fdsize; i++ {
fd, err := syscall.Socket(syscall.AF_INET, syscall.SOCK_DGRAM, 0)
if err != nil {
return em, nil
}
em.fds = append(em.fds, fd)
}

return nil, fmt.Errorf("failed to trigger EMFILE")
}

func (em *emfile) undo() error {
if em == nil {
return nil
}

if err := syscall.Setrlimit(syscall.RLIMIT_NOFILE, &em.limit); err != nil {
return err
}
for _, fd := range em.fds {
syscall.Close(fd)
}

return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to have some description/comments for this code.

require.NotNil(t, cache)

// try to trigger original crash with a nil fsnotify.Watcher
_, _ = cache.InjectDevices(&oci.Spec{}, "vendor1.com/device=dev1")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
_, _ = cache.InjectDevices(&oci.Spec{}, "vendor1.com/device=dev1")
devs := []string{"vendor1.com/device=dev1"}
devices, err := cache.InjectDevices(&oci.Spec{}, devs...)
require.Error(t, err)
require.EqualValues(t, devs, devices)

Would it make sense to test other cache APIs as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants