You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/issue-4/how-tokio-schedule-tasks.md
+10Lines changed: 10 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -3,11 +3,13 @@
3
3
This article will analyze a problem I encountered while developing CeresDB and discuss issues that may arise with Tokio scheduling. Please point out any inadequacies as my knowledge is limited.
4
4
5
5
# Background
6
+
6
7
[CeresDB](https://github.com/CeresDB/ceresdb) is a high-performance time series database designed for cloud-native. The storage engine uses an LSM-like architecture where data is first written to memtable, and when some threshold is reached, it is flushed to the underlying storage(e.g. S3). To prevent too many small files, there is also a background thread that does compaction.
7
8
8
9
In production, I found a strange problem. Whenever compaction requests increased for a table, the flush time would spike even though flush and compaction run in different thread pools and have no direct relationship. Why did they affect each other?
9
10
10
11
# Analysis
12
+
11
13
To investigate the root cause, we need to understand Tokio's task scheduling mechanism. Tokio is an event-driven, non-blocking I/O platform for writing asynchronous applications, users submit tasks via `spawn`, then Tokio's scheduler decides how to execute them, most of time using a multi-threaded scheduler.
12
14
13
15
Multi-threaded scheduler dispatches tasks to a fixed thread pool, each worker thread has a local run queue to save pending tasks. When starts, each worker thread will enter a loop to sequentially fetch and execute tasks in its run queue. Without a strategy, this scheduling can easily become imbalanced. Tokio uses work stealing to address this - when a worker's run queue is empty, it tries to "steal" tasks from other workers' queues to execute.
@@ -20,6 +22,7 @@ async move {
20
22
fut_two.await;
21
23
}
22
24
```
25
+
23
26
This async block is transformed to something below when get executed:
24
27
25
28
```rs
@@ -57,22 +60,26 @@ impl Future for AsyncFuture {
57
60
}
58
61
}
59
62
```
63
+
60
64
When we call `AsyncFuture.await`, `AsyncFuture::poll` get executed. We can see that control flow returns to the worker thread only on state transitions (Pending or Ready). If `fut_one.poll()` contains blocking API, the worker thread will be stuck on that task. Tasks on that worker's run queue are likely to not be scheduled timely despite work stealing. Let me explain this in more details:
61
65
62
66

63
67

64
68
65
69
In the above figure, there are four tasks:
70
+
66
71
- Task0, Task1 are hybrid, which contain both IO and CPU work
67
72
- Task2 and Task3 are purely CPU-bound tasks
68
73
69
74
The different execution strategy will lead to different time consumptions of the tasks.
75
+
70
76
- In Figure 1, CPU and IO tasks are mixed in one thread, Task0 and Task1 will take 35ms in the worst case.
71
77
- In Figure 2, CPU and IO tasks are separated and executed in two runtimes, in this case, Task0 and Task1 both take 20ms.
72
78
73
79
Therefore, it is generally recommended to use `spawn_blocking` to execute tasks that may take a long time to execute in Tokio, so that the worker thread can gain control as quickly as possible.
74
80
75
81
With the above knowledge in mind, let's try to analyze the question posed at the beginning of this article. The specifics of the flush and merge operations can be expressed in the following pseudo-code:
82
+
76
83
```rs
77
84
asyncfnflush() {
78
85
letinput=memtable.scan();
@@ -89,11 +96,13 @@ async fn compact() {
89
96
runtime1.block_on(flush);
90
97
runtime2.block_on(compact);
91
98
```
99
+
92
100
As we can see, both flush and compact have the above issue - `expensive_cpu_task` can block the worker thread, affecting s3 read/write times. The s3 client uses [object_store](https://docs.rs/object_store/latest/object_store/) which uses [reqwest](https://docs.rs/reqwest/latest/reqwest/) for HTTP.
93
101
94
102
If flush and compact run in the same runtime, there is no further explanation needed. But how do they affect each other when running in different runtimes? I wrote a [minimal program](https://github.com/jiacai2050/tokio-debug) to reproduce this.
95
103
96
104
This program has two runtimes, one for IO and one for CPU scenarios. All requests should take only 50ms but actual time is longer due to blocking API used in the CPU scenario. IO has no blocking so should cost around 50ms, but some tasks, especially `io-5` and `io-6`, their cost are roughly 1s:
@@ -106,6 +115,7 @@ The above log shows `io-5` already took 192ms before the HTTP request, and cost
106
115
I filed this issue [here](https://github.com/seanmonstar/reqwest/discussions/1935), but haven't get any answers yet, so the root cause here is still unclear. However I gain better understand how Tokio schedule tasks, it's kind of like Node.js, we should never block schedule thread. In CeresDB we fix it by adding a dedicated runtime to isolate CPU and IO tasks instead of disabling the pool, see [here](https://github.com/CeresDB/ceresdb/pull/907/files) if curious.
107
116
108
117
# Summary
118
+
109
119
Through a CeresDB production issue, this post explains Tokio scheduling in simple terms. Real situations are of course more complex, users need to analyze carefully how async code gets scheduled before finding a solution. Also, Tokio makes many detailed optimizations for lowest possible latency, interested readers can learn more from links below:
110
120
111
121
-[Making the Tokio scheduler 10x faster](https://tokio.rs/blog/2019-10-scheduler)
0 commit comments