-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Two meson test improvements for running integration tests #14513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mesonbuild/mtest.py
Outdated
|
||
# Make sure the termination signal actually kills the process | ||
# group, otherwise retry with a SIGKILL. | ||
with suppress(asyncio.TimeoutError): | ||
await asyncio.wait_for(p.wait(), timeout=0.5) | ||
await asyncio.wait_for(p.wait(), timeout=60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.5 could be too little but can you test what happens if a test does not respond (which can happen, for example, if the test sets a flag on SIGTERM but is stuck so that it never goes back to the main event loop)? Having to wait 60 seconds after Ctrl-C would be a bad experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it would be a bad experience but unlike having a timeout that's too small, this can be fixed in the test by making sure it doesn't ignore SIGTERM. I think we should err on the side of a larger timeout, since a timeout that's too small cannot be dealt with by users, whereas a timeout that's too large can be dealt with by users by fixing the test to not ignore SIGTERM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not talking about a test ignoring SIGTERM. The most common way to handle SIGTERM is to set a flag and handle the signal in the main event loop, but that won't happen if the task is stuck in an infinite loop. Maybe you could implement it so that when meson test
caught a SIGINT it uses the 0.5 seconds timeout, whereas SIGTERM uses a longer one?
With respect to sending SIGTERM to the parent only... it's complicated. You're sending SIGTERM to the parent to allow the parent process to orchestrate an orderly shutdown—including cleanup that might require communication with children, and it can prevent race conditions where children terminate before the parent can handle necessary cleanup.
However I'm worried that actual code in the wild may not properly handle SIGTERM or child process cleanup, which would lead to orphaned processes in applications that spawn children but don't maintain proper tracking of them. Those children would not get a SIGKILL, they would be reparented to init instead.
I guess we could try that and see if we get complaints, but also I would need to check the history more closely, because I seem to remember that the process group kill was added to fix a report. In principle I'd say that it is a bug for a process to not trap SIGTERM if it spawns processes, and therefore it's okay for the user to send SIGTERM to the process rather than the group; but in practice sending SIGTERM to the group is more robust against implementation bugs and the downside (potentially interrupting cleanup) is usually less severe than the downside of orphaned processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not talking about a test ignoring SIGTERM. The most common way to handle SIGTERM is to set a flag and handle the signal in the main event loop, but that won't happen if the task is stuck in an infinite loop. Maybe you could implement it so that when meson test caught a SIGINT it uses the 0.5 seconds timeout, whereas SIGTERM uses a longer one?
For daemons sure, but for tests? I'm pretty sure tests are mostly synchronous and generally won't handle SIGTERM at all.
Maybe you could implement it so that when meson test caught a SIGINT it uses the 0.5 seconds timeout, whereas SIGTERM uses a longer one?
I'm pretty sure I want the 60s timeout for both. If a test gets stuck and I use ctrl+c to interrupt it, I still want the logs to be saved and analyzed to figure out to help figure out why the test got stuck in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped the process group change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure tests are mostly synchronous and generally won't handle SIGTERM at all.
It depends. Remember that this does not apply to the test executables only, but also to other processes they spawn. For simple unit tests that may be true; but there can be more complicated tests that talk to the project's executable
s, and those binaries under test might very well be handling SIGTERM.
Maybe you could implement it so that when meson test caught a SIGINT it uses the 0.5 seconds timeout, whereas SIGTERM uses a longer one?
I'm pretty sure I want the 60s timeout for both.
60s is definitely too much for interactive use, especially for the "immediate exit" case (triggered in meson test
by pressing Ctrl-C three times in succession). What about adding a new option --sigterm-timeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I wouldn't mind a new option, I'll do that
meson will send SIGTERM if the test gets stuck and hits the timeout, in which case we still want to do log saving and analysis, so let's add some signal handlers which allow us to do that. This won't be very useful until mesonbuild/meson#14513 lands, since we only get half a second from meson to handle SIGTERM before it sends SIGKILL, but let's land this already so we immediately start taking advantage of the meson fix once it lands.
When using meson to run integration tests and an integration test times out, we still want the opportunity to analyze and save the logs from the integration test so we can figure out why the test got stuck. Currently, meson only gives test processes half a second to handle SIGTERM before it sends SIGKILL, which is not sufficient time to analyze all the logs and do all other cleanup. Let's bump the timeout after sending SIGTERM to 60s, so that test processes that wish to do cleanup get some time to do so. This only has an effect if the test process installs a custom SIGTERM handler. If the test process doesn't do so, it'll get exited immediately like before.
meson will send SIGTERM if the test gets stuck and hits the timeout, in which case we still want to do log saving and analysis, so let's add some signal handlers which allow us to do that. This won't be very useful until mesonbuild/meson#14513 lands, since we only get half a second from meson to handle SIGTERM before it sends SIGKILL, but let's land this already so we immediately start taking advantage of the meson fix once it lands.
Looks good to me, thanks. If you want you can add release notes in docs/markdown/snippets too. |
No description provided.