Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go Scheduler Weirdness #7

Open
michaelorr opened this issue Oct 13, 2016 · 4 comments
Open

Go Scheduler Weirdness #7

michaelorr opened this issue Oct 13, 2016 · 4 comments

Comments

@michaelorr
Copy link

I was migrating a Jenkins1 job using a Vagrant VM to a Jenkins2 job using a docker container.
In the process, one of our tests which had previously (under the Vagrant world) run reliably and PASSed without complaint. In the docker world however, it became very flaky and would fail 20-30% of the time.
I narrowed the failure down to a specific check that was timing sensitive only in the container environment.
We were creating an instance of a struct that represented a Publisher object which would pull tasks from the work chan. If the publisher panicked on sending, a deferred handler would clean up the Publisher. Our test was creating a Publisher, inducing a panic by putting a nil pointer on the chan, calling runtime.Gosched() and then watching the for the publisher to clean itself up.
In the docker container, we would see failures due to the Publisher having not been cleaned up.
I noticed that if I printed the struct out, I would see it showing the incorrect values at a rate that we saw test failures previously, but even when the struct showed itself as non-cleaned up, the test now passed, but there were fewer exhibited failures this time around.
So I added a 2nd print statement, and it was clear that between the first and 2nd print statements, the Publisher was getting cleaned up.
I changed the prints to a sleep and it would reliably work with a sleep as small as 1ms and would nearly work reliably with a sleep as small as 1 nanosecond.
So it seems that runtime.Gosched isn't actually releasing to other running goroutines, it only signals that the goroutine may be pre-empted. We still needed something to incur a blocking call in order for the main thread of non-test-execution to complete. I didn't want to modify production code to satisfy the tests and we had no other way of watching for completion of the deferred error handler.

The "solution" is that we were improperly running with a system test and not a unit test and it was unreliable, but silently so in the vm environment. Plans are to remove the test as it currently exists and add a unit test specifically for the error handler method.

see also: runtime.Gosched(), GOMAXPROCS, https://blog.golang.org/defer-panic-and-recover, https://golang.org/pkg/runtime/#Gosched
https://ariejan.net/2014/08/29/synchronize-goroutines-in-your-tests/http://stackoverflow.com/questions/31750100/how-do-i-tell-my-test-to-wait-for-a-callback-in-a-goroutine
http://stackoverflow.com/questions/28868596/test-golang-goroutinehttps://nathanleclaire.com/blog/2014/02/15/how-to-wait-for-all-goroutines-to-finish-executing-before-continuing/

@calebcase
Copy link

That is really interesting. It seems very odd that it would work reliably under a VM, but not inside a container. I wonder what induces the difference in runtime behavior.

@michaelorr
Copy link
Author

I'm still investigating this... I haven't been able to get past the "wtf" phase... In theory I could write this up as a post with open questions, but I would prefer it if I could post it with "I figured it out!" .... thoughts?

@brimstone
Copy link
Member

I don't think you should post something unsolved, soliciting comments, mainly because we don't have a commenting system setup. I'm not against setting up a comment system. You could go that route if you'd like.

@michaelorr
Copy link
Author

Good point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants