Do not interrupt underlying Azure repository threads during errors #99320

fcofdez · 2023-09-07T17:45:03Z

We subscribe into a different thread to read from the blocking input stream in order to avoid blocking the azure client event loop. When the connection is dropped, reactor interrupts the thread where the input stream is read to cancel the task promptly, this causes issues and adds confusing error messages to the exception chain, hiding important details.

We subscribe into a different thread to read from the blocking input stream in order to avoid blocking the azure client event loop. When the connection is dropped, reactor interrupts the thread where the input stream is read to cancel the task promptly, this causes issues and adds confusing error messages to the exception chain, hidding important details.

elasticsearchmachine · 2023-09-07T17:45:28Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2023-09-07T17:45:29Z

Hi @fcofdez, I've created a changelog YAML for you.

volodk85 · 2023-09-07T17:58:02Z

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java

@@ -472,4 +474,32 @@ public void testRetryFromSecondaryLocationPolicies() throws Exception {
            assertThat(failedGetCalls.get(), equalTo(1));
        }
    }
+
+    public void testPrematureClosedConnectionDoesNotInterruptBackingThread() throws Exception {


could you please point on the place where you simulate thread.interrupt()? is it missed or interruption is implicit somehow?

The interrupt comes from the future being cancelled due to the premature connection close set up in the httpServer.createContext line - if you undo the fix (e.g. apply the following patch) then the test fails:

diff --git a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java index f621cfe3e979..4c2c378acb12 100644 --- a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java +++ b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java @@ -202,7 +202,7 @@ public class ReactorScheduledExecutorService extends AbstractExecutorService imp @Override public boolean cancel(boolean mayInterruptIfRunning) { // Ensure that the thread is never interrupted - return delegate.cancel(false); + return delegate.cancel(mayInterruptIfRunning); } @Override

volodk85

Interrupting read thread when connection is dropped sounds as a good thing - why wait for it if we won't succeed anyway. Maybe I'm missing some context, could you please elaborate?

DaveCTurner · 2023-09-07T18:23:10Z

...azure/src/test/java/org/elasticsearch/repositories/azure/AzureBlobContainerRetriesTests.java

@@ -472,4 +474,32 @@ public void testRetryFromSecondaryLocationPolicies() throws Exception {
            assertThat(failedGetCalls.get(), equalTo(1));
        }
    }
+
+    public void testPrematureClosedConnectionDoesNotInterruptBackingThread() throws Exception {


The interrupt comes from the future being cancelled due to the premature connection close set up in the httpServer.createContext line - if you undo the fix (e.g. apply the following patch) then the test fails:

diff --git a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java index f621cfe3e979..4c2c378acb12 100644 --- a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java +++ b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java @@ -202,7 +202,7 @@ public class ReactorScheduledExecutorService extends AbstractExecutorService imp @Override public boolean cancel(boolean mayInterruptIfRunning) { // Ensure that the thread is never interrupted - return delegate.cancel(false); + return delegate.cancel(mayInterruptIfRunning); } @Override

DaveCTurner · 2023-09-07T18:24:19Z

...ain/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java

+    protected <T> RunnableFuture<T> newTaskFor(Runnable runnable, T value) {
+        return new UninterruptibleFuture<>(super.newTaskFor(runnable, value));
+    }
+
+    @Override
+    protected <T> RunnableFuture<T> newTaskFor(Callable<T> callable) {
+        return new UninterruptibleFuture<>(super.newTaskFor(callable));
+    }


Are these creating a new task for each read operation? Just thinking about how promptly a read might be cancelled without the interrupt.

Yes, see org.elasticsearch.repositories.azure.AzureBlobStore#convertStreamToByteBuffer. We're reading from the input stream in 64Kb chunks, and we emit those in order; meaning that after the pipeline has been cancelled at most we'll need to wait until we're able to read 64Kb from disk.

fcofdez · 2023-09-08T09:28:56Z

Interrupting read thread when connection is dropped sounds as a good thing - why wait for it if we won't succeed anyway. Maybe I'm missing some context, could you please elaborate?

When org.apache.lucene.store.NIOFSDirectory.NIOFSIndexInput is used to read files from disk and the thread that's reading data from input is interrupted, the underlying FileChannel is closed and in the next read an exception is thrown. This shouldn't be a problem, but since that read is running asynchronously, it adds unnecessary noise to the logs and hides the real cause of the issue.

fcofdez · 2023-09-11T07:47:35Z

Additionally, if the underlying FileChannel is closed due to the interruption, retries won't succeed since they use the same underlying file channel.

elasticsearchmachine · 2025-01-30T16:58:23Z

Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete))

elasticsearchmachine · 2025-01-30T16:58:23Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

fcofdez added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.11.0 v8.10.1 labels Sep 7, 2023

Update docs/changelog/99320.yaml

3727c5f

volodk85 reviewed Sep 7, 2023

View reviewed changes

DaveCTurner reviewed Sep 7, 2023

View reviewed changes

williamrandolph added v8.10.2 v8.10.3 and removed v8.10.1 v8.10.2 labels Sep 18, 2023

mattc58 added v8.12.0 and removed v8.11.0 labels Oct 4, 2023

masseyke added v8.10.4 and removed v8.10.3 labels Oct 6, 2023

andreidan added v8.10.5 and removed v8.10.4 labels Oct 12, 2023

brianseeders added v8.13.0 and removed v8.12.0 labels Dec 6, 2023

elasticsearchmachine added v8.14.0 and removed v8.13.0 labels Feb 14, 2024

elasticsearchmachine added v8.15.0 and removed v8.14.0 labels Apr 17, 2024

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mark-vieira added v9.0.0 and removed v8.16.0 labels Sep 11, 2024

elasticsearchmachine added v9.1.0 Team:Distributed Coordination Meta label for Distributed Coordination team and removed v9.0.0 labels Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Do not interrupt underlying Azure repository threads during errors #99320

Do not interrupt underlying Azure repository threads during errors #99320

Uh oh!

fcofdez commented Sep 7, 2023

Uh oh!

elasticsearchmachine commented Sep 7, 2023

Uh oh!

elasticsearchmachine commented Sep 7, 2023

Uh oh!

volodk85 Sep 7, 2023

Uh oh!

DaveCTurner Sep 7, 2023

Uh oh!

volodk85 left a comment

Uh oh!

DaveCTurner Sep 7, 2023

Uh oh!

DaveCTurner Sep 7, 2023

Uh oh!

fcofdez Sep 8, 2023

Uh oh!

fcofdez commented Sep 8, 2023

Uh oh!

fcofdez commented Sep 11, 2023

Uh oh!

elasticsearchmachine commented Jan 30, 2025

Uh oh!

elasticsearchmachine commented Jan 30, 2025

Uh oh!

Uh oh!

Do not interrupt underlying Azure repository threads during errors #99320

Are you sure you want to change the base?

Do not interrupt underlying Azure repository threads during errors #99320

Uh oh!

Conversation

fcofdez commented Sep 7, 2023

Uh oh!

elasticsearchmachine commented Sep 7, 2023

Uh oh!

elasticsearchmachine commented Sep 7, 2023

Uh oh!

volodk85 Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

volodk85 left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

DaveCTurner Sep 7, 2023

Choose a reason for hiding this comment

Uh oh!

fcofdez Sep 8, 2023

Choose a reason for hiding this comment

Uh oh!

fcofdez commented Sep 8, 2023

Uh oh!

fcofdez commented Sep 11, 2023

Uh oh!

elasticsearchmachine commented Jan 30, 2025

Uh oh!

elasticsearchmachine commented Jan 30, 2025

Uh oh!

Uh oh!