-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Do not interrupt underlying Azure repository threads during errors #99320
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
We subscribe into a different thread to read from the blocking input stream in order to avoid blocking the azure client event loop. When the connection is dropped, reactor interrupts the thread where the input stream is read to cancel the task promptly, this causes issues and adds confusing error messages to the exception chain, hidding important details.
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @fcofdez, I've created a changelog YAML for you. |
@@ -472,4 +474,32 @@ public void testRetryFromSecondaryLocationPolicies() throws Exception { | |||
assertThat(failedGetCalls.get(), equalTo(1)); | |||
} | |||
} | |||
|
|||
public void testPrematureClosedConnectionDoesNotInterruptBackingThread() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you please point on the place where you simulate thread.interrupt()? is it missed or interruption is implicit somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interrupt comes from the future being cancelled due to the premature connection close set up in the httpServer.createContext
line - if you undo the fix (e.g. apply the following patch) then the test fails:
diff --git a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
index f621cfe3e979..4c2c378acb12 100644
--- a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
+++ b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
@@ -202,7 +202,7 @@ public class ReactorScheduledExecutorService extends AbstractExecutorService imp
@Override
public boolean cancel(boolean mayInterruptIfRunning) {
// Ensure that the thread is never interrupted
- return delegate.cancel(false);
+ return delegate.cancel(mayInterruptIfRunning);
}
@Override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interrupting read thread when connection is dropped sounds as a good thing - why wait for it if we won't succeed anyway. Maybe I'm missing some context, could you please elaborate?
@@ -472,4 +474,32 @@ public void testRetryFromSecondaryLocationPolicies() throws Exception { | |||
assertThat(failedGetCalls.get(), equalTo(1)); | |||
} | |||
} | |||
|
|||
public void testPrematureClosedConnectionDoesNotInterruptBackingThread() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interrupt comes from the future being cancelled due to the premature connection close set up in the httpServer.createContext
line - if you undo the fix (e.g. apply the following patch) then the test fails:
diff --git a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
index f621cfe3e979..4c2c378acb12 100644
--- a/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
+++ b/modules/repository-azure/src/main/java/org/elasticsearch/repositories/azure/executors/ReactorScheduledExecutorService.java
@@ -202,7 +202,7 @@ public class ReactorScheduledExecutorService extends AbstractExecutorService imp
@Override
public boolean cancel(boolean mayInterruptIfRunning) {
// Ensure that the thread is never interrupted
- return delegate.cancel(false);
+ return delegate.cancel(mayInterruptIfRunning);
}
@Override
protected <T> RunnableFuture<T> newTaskFor(Runnable runnable, T value) { | ||
return new UninterruptibleFuture<>(super.newTaskFor(runnable, value)); | ||
} | ||
|
||
@Override | ||
protected <T> RunnableFuture<T> newTaskFor(Callable<T> callable) { | ||
return new UninterruptibleFuture<>(super.newTaskFor(callable)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these creating a new task for each read operation? Just thinking about how promptly a read might be cancelled without the interrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, see org.elasticsearch.repositories.azure.AzureBlobStore#convertStreamToByteBuffer
. We're reading from the input stream in 64Kb chunks, and we emit those in order; meaning that after the pipeline has been cancelled at most we'll need to wait until we're able to read 64Kb from disk.
When |
Additionally, if the underlying FileChannel is closed due to the interruption, retries won't succeed since they use the same underlying file channel. |
Pinging @elastic/es-distributed-obsolete (Team:Distributed (Obsolete)) |
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
We subscribe into a different thread to read from the blocking input stream in order to avoid blocking the azure client event loop. When the connection is dropped, reactor interrupts the thread where the input stream is read to cancel the task promptly, this causes issues and adds confusing error messages to the exception chain, hiding important details.