Skip to content

Do not interrupt underlying Azure repository threads during errors #99320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Sep 7, 2023

We subscribe into a different thread to read from the blocking input stream in order to avoid blocking the azure client event loop. When the connection is dropped, reactor interrupts the thread where the input stream is read to cancel the task promptly, this causes issues and adds confusing error messages to the exception chain, hiding important details.

We subscribe into a different thread to read from the blocking input
stream in order to avoid blocking the azure client event loop. When
the connection is dropped, reactor interrupts the thread where the
input stream is read to cancel the task promptly, this causes issues
and adds confusing error messages to the exception chain, hidding
important details.
@fcofdez fcofdez added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.11.0 v8.10.1 labels Sep 7, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

@@ -472,4 +474,32 @@ public void testRetryFromSecondaryLocationPolicies() throws Exception {
assertThat(failedGetCalls.get(), equalTo(1));
}
}

public void testPrematureClosedConnectionDoesNotInterruptBackingThread() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please point on the place where you simulate thread.interrupt()? is it missed or interruption is implicit somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interrupt comes from the future being cancelled due to the premature connection close set up in the httpServer.createContext line - if you undo the fix (e.g. apply the following patch) then the test fails:

diff --git a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
index f621cfe3e979..4c2c378acb12 100644
--- a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
+++ b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
@@ -202,7 +202,7 @@ public class ReactorScheduledExecutorService extends AbstractExecutorService imp
         @Override
         public boolean cancel(boolean mayInterruptIfRunning) {
             // Ensure that the thread is never interrupted
-            return delegate.cancel(false);
+            return delegate.cancel(mayInterruptIfRunning);
         }

         @Override

Copy link
Contributor

@volodk85 volodk85 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interrupting read thread when connection is dropped sounds as a good thing - why wait for it if we won't succeed anyway. Maybe I'm missing some context, could you please elaborate?

@@ -472,4 +474,32 @@ public void testRetryFromSecondaryLocationPolicies() throws Exception {
assertThat(failedGetCalls.get(), equalTo(1));
}
}

public void testPrematureClosedConnectionDoesNotInterruptBackingThread() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interrupt comes from the future being cancelled due to the premature connection close set up in the httpServer.createContext line - if you undo the fix (e.g. apply the following patch) then the test fails:

diff --git a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
index f621cfe3e979..4c2c378acb12 100644
--- a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
+++ b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
@@ -202,7 +202,7 @@ public class ReactorScheduledExecutorService extends AbstractExecutorService imp
         @Override
         public boolean cancel(boolean mayInterruptIfRunning) {
             // Ensure that the thread is never interrupted
-            return delegate.cancel(false);
+            return delegate.cancel(mayInterruptIfRunning);
         }

         @Override

Comment on lines +129 to +136
protected <T> RunnableFuture<T> newTaskFor(Runnable runnable, T value) {
return new UninterruptibleFuture<>(super.newTaskFor(runnable, value));
}

@Override
protected <T> RunnableFuture<T> newTaskFor(Callable<T> callable) {
return new UninterruptibleFuture<>(super.newTaskFor(callable));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these creating a new task for each read operation? Just thinking about how promptly a read might be cancelled without the interrupt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, see org.elasticsearch.repositories.azure.AzureBlobStore#convertStreamToByteBuffer. We're reading from the input stream in 64Kb chunks, and we emit those in order; meaning that after the pipeline has been cancelled at most we'll need to wait until we're able to read 64Kb from disk.

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 8, 2023

Interrupting read thread when connection is dropped sounds as a good thing - why wait for it if we won't succeed anyway. Maybe I'm missing some context, could you please elaborate?

When org.apache.lucene.store.NIOFSDirectory.NIOFSIndexInput is used to read files from disk and the thread that's reading data from input is interrupted, the underlying FileChannel is closed and in the next read an exception is thrown. This shouldn't be a problem, but since that read is running asynchronously, it adds unnecessary noise to the logs and hides the real cause of the issue.

@fcofdez
Copy link
Contributor Author

fcofdez commented Sep 11, 2023

Additionally, if the underlying FileChannel is closed due to the interruption, retries won't succeed since they use the same underlying file channel.

@elasticsearchmachine elasticsearchmachine added v9.1.0 Team:Distributed Coordination Meta label for Distributed Coordination team and removed v9.0.0 labels Jan 30, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete))

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Coordination Meta label for Distributed Coordination team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.10.5 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants