You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was migrating a Jenkins1 job using a Vagrant VM to a Jenkins2 job using a docker container.
In the process, one of our tests which had previously (under the Vagrant world) run reliably and PASSed without complaint. In the docker world however, it became very flaky and would fail 20-30% of the time.
I narrowed the failure down to a specific check that was timing sensitive only in the container environment.
We were creating an instance of a struct that represented a Publisher object which would pull tasks from the work chan. If the publisher panicked on sending, a deferred handler would clean up the Publisher. Our test was creating a Publisher, inducing a panic by putting a nil pointer on the chan, calling runtime.Gosched() and then watching the for the publisher to clean itself up.
In the docker container, we would see failures due to the Publisher having not been cleaned up.
I noticed that if I printed the struct out, I would see it showing the incorrect values at a rate that we saw test failures previously, but even when the struct showed itself as non-cleaned up, the test now passed, but there were fewer exhibited failures this time around.
So I added a 2nd print statement, and it was clear that between the first and 2nd print statements, the Publisher was getting cleaned up.
I changed the prints to a sleep and it would reliably work with a sleep as small as 1ms and would nearly work reliably with a sleep as small as 1 nanosecond.
So it seems that runtime.Gosched isn't actually releasing to other running goroutines, it only signals that the goroutine may be pre-empted. We still needed something to incur a blocking call in order for the main thread of non-test-execution to complete. I didn't want to modify production code to satisfy the tests and we had no other way of watching for completion of the deferred error handler.
The "solution" is that we were improperly running with a system test and not a unit test and it was unreliable, but silently so in the vm environment. Plans are to remove the test as it currently exists and add a unit test specifically for the error handler method.
That is really interesting. It seems very odd that it would work reliably under a VM, but not inside a container. I wonder what induces the difference in runtime behavior.
I'm still investigating this... I haven't been able to get past the "wtf" phase... In theory I could write this up as a post with open questions, but I would prefer it if I could post it with "I figured it out!" .... thoughts?
I don't think you should post something unsolved, soliciting comments, mainly because we don't have a commenting system setup. I'm not against setting up a comment system. You could go that route if you'd like.
I was migrating a Jenkins1 job using a Vagrant VM to a Jenkins2 job using a docker container.
In the process, one of our tests which had previously (under the Vagrant world) run reliably and PASSed without complaint. In the docker world however, it became very flaky and would fail 20-30% of the time.
I narrowed the failure down to a specific check that was timing sensitive only in the container environment.
We were creating an instance of a struct that represented a Publisher object which would pull tasks from the work chan. If the publisher panicked on sending, a deferred handler would clean up the Publisher. Our test was creating a Publisher, inducing a panic by putting a nil pointer on the chan, calling runtime.Gosched() and then watching the for the publisher to clean itself up.
In the docker container, we would see failures due to the Publisher having not been cleaned up.
I noticed that if I printed the struct out, I would see it showing the incorrect values at a rate that we saw test failures previously, but even when the struct showed itself as non-cleaned up, the test now passed, but there were fewer exhibited failures this time around.
So I added a 2nd print statement, and it was clear that between the first and 2nd print statements, the Publisher was getting cleaned up.
I changed the prints to a sleep and it would reliably work with a sleep as small as 1ms and would nearly work reliably with a sleep as small as 1 nanosecond.
So it seems that runtime.Gosched isn't actually releasing to other running goroutines, it only signals that the goroutine may be pre-empted. We still needed something to incur a blocking call in order for the main thread of non-test-execution to complete. I didn't want to modify production code to satisfy the tests and we had no other way of watching for completion of the deferred error handler.
The "solution" is that we were improperly running with a system test and not a unit test and it was unreliable, but silently so in the vm environment. Plans are to remove the test as it currently exists and add a unit test specifically for the error handler method.
see also: runtime.Gosched(), GOMAXPROCS, https://blog.golang.org/defer-panic-and-recover, https://golang.org/pkg/runtime/#Gosched
https://ariejan.net/2014/08/29/synchronize-goroutines-in-your-tests/http://stackoverflow.com/questions/31750100/how-do-i-tell-my-test-to-wait-for-a-callback-in-a-goroutine
http://stackoverflow.com/questions/28868596/test-golang-goroutinehttps://nathanleclaire.com/blog/2014/02/15/how-to-wait-for-all-goroutines-to-finish-executing-before-continuing/
The text was updated successfully, but these errors were encountered: