Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] deadlock/starvation with MSI #2601

Open
ecki opened this issue Feb 11, 2025 · 8 comments
Open

[QUESTION] deadlock/starvation with MSI #2601

ecki opened this issue Feb 11, 2025 · 8 comments
Assignees
Labels
Question Used when a question is asked, as opposed to an issue being raised Waiting for Response Waiting for a reply from the original poster, or affiliated party

Comments

@ecki
Copy link
Contributor

ecki commented Feb 11, 2025

Question

In our product (which uses the Karaf OSGI container (has therefore a bit more complex class loader structure) custom connection pool) we see regular hangs on startup on a low-cpu Kubernetes pod. It resolves sometimes after hours but does not work reliable.

I havent found the real issue yet, however we do see a hanging MSI authentication in the common worker pool.

Question: since it is a known issue of MSAL that the common pool can cause problems - and some code of the msjdbc already uses a dedicated executor (looks like vaul/client encryption uses it?), would it make sense to also specify a pool for the auth?

URL

jdbc:sqlserver://<redacted>.database.windows.net:1433;database=<redacted>;
    authentication=ActiveDirectoryManagedIdentity;user=<redacted>;
    sendStringParametersAsUnicode=false;encrypt=true;trustServerCertificate=false;
    hostNameInCertificate=*.database.windows.net;loginTimeout=30

The loginTimeout does not help.

SQLServerSecurityUtility.getManagedIdentityCredAuthToken

Here is a partial stack trace of the "hanging" thread.

"ForkJoinPool.commonPool-worker-1" #122 [709] daemon prio=5 os_prio=0 cpu=272.54ms elapsed=936.64s tid=0x00007f955392b160 nid=709 waiting on condition  [0x00007f94e6fec000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21.0.4/Native Method)
	- parking to wait for  <0x0000000690a0c728> (a java.util.concurrent.CountDownLatch$Sync)
	at java.util.concurrent.locks.LockSupport.park(java.base@21.0.4/LockSupport.java:221)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21.0.4/AbstractQueuedSynchronizer.java:754)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@21.0.4/AbstractQueuedSynchronizer.java:1099)
	at java.util.concurrent.CountDownLatch.await(java.base@21.0.4/CountDownLatch.java:230)
	at reactor.core.publisher.BlockingOptionalMonoSubscriber.blockingGet(BlockingOptionalMonoSubscriber.java:112)
	at reactor.core.publisher.Mono.blockOptional(Mono.java:1831)
	at com.microsoft.sqlserver.jdbc.SQLServerSecurityUtility.getManagedIdentityCredAuthToken(SQLServerSecurityUtility.java:382)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.getFedAuthToken(SQLServerConnection.java:6041)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.onFedAuthInfo(SQLServerConnection.java:5995)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.processFedAuthInfo(SQLServerConnection.java:5829)
	at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onFedAuthInfo(tdsparser.java:335)
	at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:130)
	at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:42)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:6888)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:5434)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:5366)
	at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7745)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:4391)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3828)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:3385)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:3194)
	at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1971)
	at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1263)

I would open a bug report, but our analysis is still early/murky.

We think due to the dynamic sizing of the pool the issue is more prominent on 1-2 vCore Pods.

@github-project-automation github-project-automation bot moved this to To be triaged in MSSQL JDBC Feb 11, 2025
@machavan machavan self-assigned this Feb 11, 2025
@machavan machavan added the Under Investigation Used for issues under investigation label Feb 11, 2025
@Jeffery-Wasty Jeffery-Wasty moved this from To be triaged to Under Investigation in MSSQL JDBC Feb 11, 2025
@Jeffery-Wasty Jeffery-Wasty added the Question Used when a question is asked, as opposed to an issue being raised label Feb 11, 2025
@machavan
Copy link
Contributor

Hi @ecki

It looks like we have addressed this issue in PR : #2562, by adding timeout to the token acquisition calls.

Can you please check and confirm?

@machavan machavan added Waiting for Response Waiting for a reply from the original poster, or affiliated party and removed Under Investigation Used for issues under investigation labels Feb 12, 2025
@Jeffery-Wasty Jeffery-Wasty moved this from Under Investigation to Waiting for Customer in MSSQL JDBC Feb 12, 2025
@ecki
Copy link
Contributor Author

ecki commented Feb 14, 2025

I can test it - do you have a snapshot build somewhere, or should I try to build it?

@ecki
Copy link
Contributor Author

ecki commented Feb 17, 2025

We tried the master build but it has (again) the problem of some missing optional package declarations for OSGi [ - reported as #2608 now ], However if we fix that we get other problems, still on it. (Maybe it’s easier to backport the timeout for testing).

@ecki
Copy link
Contributor Author

ecki commented Feb 17, 2025

We cherry-picked the timeout fix on top of 12.8.1 and it did improve the situation. We need to do more tests to see if it really helped in all cases, but that looks promising.

I am curious on how often those timeouts happen and why, and also if it really blocks the common executor for 5s every time, that might still hurt the system. So do you have plans in regarrds to using a dedicated executor? (maybe use the same as the vault-case?). Maybe that dedicated executur can also be limited in parallelity to be a bit nicer to the token infrastructure - could be a mean to reduce the timeout likelyhood?

Now we only need to fix the OSGi problems we see with the master branch. There are a few packages missing an optional in the import, but also we get a strange error that it cant load the nio filesystem provider for "bundle:" (yes thats odd).

@machavan
Copy link
Contributor

Thanks @ecki for verifying and confirming.

The OSGi issue is being addressed in #2609

Thank you for your suggestions on improvement on token acquisition infrastructure, we have some plans on revisiting it. However, the current goal in the fix delivered in : #2562 was to avoid indefinite blocking due to MSAL calls and reduce the waiting of multiple simultaneous token acquisition requests in case the first request takes longer, but at the same time not overburden the AAD server.

@ecki
Copy link
Contributor Author

ecki commented Feb 19, 2025

I can confirm, the timeouts improved the hangs, but those occasional 20s timeouts are pretty bad. Any idea why they are so frequent?

Caused by: reactor.core.Exceptions$ReactiveException: java.util.concurrent.TimeoutException: Did not observe any item or terminal signal within 20000ms in 'peekTerminal' (and no fallback has been configured)
        at reactor.core.Exceptions.propagate(Exceptions.java:410)
        at reactor.core.publisher.BlockingOptionalMonoSubscriber.blockingGet(BlockingOptionalMonoSubscriber.java:127)
        at reactor.core.publisher.Mono.blockOptional(Mono.java:1831)
        at com.microsoft.sqlserver.jdbc.SQLServerSecurityUtility.getManagedIdentityCredAuthToken(SQLServerSecurityUtility.java:386)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.getFedAuthToken(SQLServerConnection.java:6047)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.onFedAuthInfo(SQLServerConnection.java:5995)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.processFedAuthInfo(SQLServerConnection.java:5829)
        at com.microsoft.sqlserver.jdbc.TDSTokenHandler.onFedAuthInfo(tdsparser.java:335)
        at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:130)
        at com.microsoft.sqlserver.jdbc.TDSParser.parse(tdsparser.java:42)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.sendLogon(SQLServerConnection.java:6889)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.logon(SQLServerConnection.java:5434)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection$LogonCommand.doExecute(SQLServerConnection.java:5366)
        at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:7745)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:4391)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectHelper(SQLServerConnection.java:3828)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.login(SQLServerConnection.java:3385)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectInternal(SQLServerConnection.java:3194)
        at com.microsoft.sqlserver.jdbc.SQLServerConnection.connect(SQLServerConnection.java:1971)
        at com.microsoft.sqlserver.jdbc.SQLServerDriver.connect(SQLServerDriver.java:1263)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:683)
        at java.sql/java.sql.DriverManager.getConnection(DriverManager.java:191)
        at com….ConnectionPool.getDriverConnection(ConnectionPool.java:1345)

This is on AKS.

How often should that token be requested, is there a cache I need to support in my pool?

@machavan
Copy link
Contributor

Hi @ecki

The timeouts would depend on the AAD server and its ability to return back the token within the 20 seconds. The token cache is already leveraged as part of the token acquisition process, you would need to do anything explicitly for that.

@ecki
Copy link
Contributor Author

ecki commented Feb 28, 2025

The timeouts would depend on the AAD server and its ability to return back the token

Thanks for confirmation. I adviced our (mutual) end Customer to involve Azure support, As I don’t think that high error rate they observer can be intended - maybe its due to AKS Workload Idendity being in the mix or similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Used when a question is asked, as opposed to an issue being raised Waiting for Response Waiting for a reply from the original poster, or affiliated party
Projects
Status: Waiting for Customer
Development

No branches or pull requests

3 participants