-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QUESTION] deadlock/starvation with MSI #2601
Comments
I can test it - do you have a snapshot build somewhere, or should I try to build it? |
We tried the master build but it has (again) the problem of some missing optional package declarations for OSGi [ - reported as #2608 now ], However if we fix that we get other problems, still on it. (Maybe it’s easier to backport the timeout for testing). |
We cherry-picked the timeout fix on top of 12.8.1 and it did improve the situation. We need to do more tests to see if it really helped in all cases, but that looks promising. I am curious on how often those timeouts happen and why, and also if it really blocks the common executor for 5s every time, that might still hurt the system. So do you have plans in regarrds to using a dedicated executor? (maybe use the same as the vault-case?). Maybe that dedicated executur can also be limited in parallelity to be a bit nicer to the token infrastructure - could be a mean to reduce the timeout likelyhood? Now we only need to fix the OSGi problems we see with the master branch. There are a few packages missing an optional in the import, but also we get a strange error that it cant load the nio filesystem provider for "bundle:" (yes thats odd). |
Thanks @ecki for verifying and confirming. The OSGi issue is being addressed in #2609 Thank you for your suggestions on improvement on token acquisition infrastructure, we have some plans on revisiting it. However, the current goal in the fix delivered in : #2562 was to avoid indefinite blocking due to MSAL calls and reduce the waiting of multiple simultaneous token acquisition requests in case the first request takes longer, but at the same time not overburden the AAD server. |
I can confirm, the timeouts improved the hangs, but those occasional 20s timeouts are pretty bad. Any idea why they are so frequent?
This is on AKS. How often should that token be requested, is there a cache I need to support in my pool? |
Hi @ecki The timeouts would depend on the AAD server and its ability to return back the token within the 20 seconds. The token cache is already leveraged as part of the token acquisition process, you would need to do anything explicitly for that. |
Thanks for confirmation. I adviced our (mutual) end Customer to involve Azure support, As I don’t think that high error rate they observer can be intended - maybe its due to AKS Workload Idendity being in the mix or similar. |
Question
In our product (which uses the Karaf OSGI container (has therefore a bit more complex class loader structure) custom connection pool) we see regular hangs on startup on a low-cpu Kubernetes pod. It resolves sometimes after hours but does not work reliable.
I havent found the real issue yet, however we do see a hanging MSI authentication in the common worker pool.
Question: since it is a known issue of MSAL that the common pool can cause problems - and some code of the msjdbc already uses a dedicated executor (looks like vaul/client encryption uses it?), would it make sense to also specify a pool for the auth?
URL
The loginTimeout does not help.
SQLServerSecurityUtility.getManagedIdentityCredAuthToken
mssql-jdbc/src/main/java/com/microsoft/sqlserver/jdbc/SQLServerSecurityUtility.java
Line 368 in e63814e
Here is a partial stack trace of the "hanging" thread.
I would open a bug report, but our analysis is still early/murky.
We think due to the dynamic sizing of the pool the issue is more prominent on 1-2 vCore Pods.
mssql-jdbc/src/main/java/com/microsoft/sqlserver/jdbc/SQLServerMSAL4JUtils.java
Line 124 in e63814e
The text was updated successfully, but these errors were encountered: