-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprinciple.tex
618 lines (574 loc) · 33.3 KB
/
principle.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
\section{Safety in \projname{}}
\label{sec:principle}
% \jinghao{\S\ref{sec:principle}: How to make it safe}
%\begin{itemize}
% \item goal: expressive and safe (more expressive than and at least as safe
% as eBPF)
% \item Current problem of eBPF: the verifier
% \item Idea: remove the verifier and use a safe language
% \item the language should be Turing complete and at the same time low-level
% enough to provide safety features.
% \item We choose Rust as the implementation language.
% \item The use of Rust automatically provides eexpresssiveness
% \item We then show how a safety can be achieved for kernel extensions via
% language-based safety from Rust.
% \item how to ensure safety (high-level discussion, no implementation
% details)
% \begin{itemize}
% \item builtin memory/control/type safety
% \begin{itemize}
% \item generic and const-generic functions and slices to
% prevent OOB access
% \item Safe direct packet access for XDP programs
% \item Retired expressiveness kernel helpers (\jinghao{probably does not belong here})
% \end{itemize}
% \item RAII (e.g. lock, refcnt)
% \item exception handling and stack unwinding
% \begin{itemize}
% \item Needed because Rust itself has runtime checks (e.g.
% array OOB access)
% \item handle exceptional control flow
% \item clean up resources to achieve RAII under exceptional
% circumstances
% \end{itemize}
% \item runtime mechanism to support properties that are
% (fundamentally) hard to check at compile time
% \begin{itemize}
% \item program termination
% \item stack overflow protection
% \end{itemize}
% \end{itemize}
%\end{itemize}
% Some small things here and there
%
% - no clone impl for objects returned by RT crate to ensure uniqueness
% - difference between userspace Rust: there are things the compiler cannot see
% e.g., we cannot have something like Mutex::get_mut in bpf_spinlock to
% access a value w/o lock using the single-mutable-reference rule
The goal of \projname{} is to provide enhanced usability for kernel extensions,
while ensuring safety of the extension programs.
In particular, we aim to make \projname{} more usable than eBPF and as safe.
As discussed in \S\ref{sec:motivation}, the current problem of usability of
eBPF is the large \gap{}. % between the programer, who works on
% high-level languages, and the verifier, which operates on compiled bytecode.
% Our insight is that both the needed expressiveness and safety can be obtained
% from a safe programming language without the verifier.
Our insight is that by using a safe, high-level language that directly
implements the required safety properties of kernel extensions, the need
for the extra verification layer can be dropped, and thereby closing the
\gap{}.
\mvle{not completely true tho since we need the runtime part for termination}
% Specifically, the language should have the following properties:
% \begin{itemize}
% \item \textbf{Turing-complete}: This helps to satisfy the expressiveness
% requirement.
% \item \textbf{Safe}: The language should not be as permissive and
% error-prone as unsafe languages like C.
% \item \textbf{Low-level}: Being low-level helps the programs to better
% model low-level kernel semantics and simplifies infrastructure needed
% to support it.
% \end{itemize}
% \djw{background on Rust moves here}
% Need to talk about:
% - Language-based safety no undefined behavior in Rust.
% - Safe/unsafe Rust.
% - Popularity (also embraced by kernel)
\projname{} chooses Rust as the safe language for extension programs.
Rust is a system programming language that aims to provide
\emph{language-based} safety, which promises programs to be free of
\emph{undefined behaviors}.
The compiler both performs static analysis (e.g., lifetime and ownership
models) and inserts dynamic checks (e.g. array bounds checks) in
Rust programs to fulfill its safety promises.
To facilitate low-level code that may be hard to fit into its safety model (e.g.
FFI calls), the language also provides \emph{unsafe} Rust, which is more
permissive and provides a weaker safety guarantee, as opposed to
\emph{safe} Rust.
Utilization of the language-based safety of Rust under kernel contexts has been
explored by previous works~\cite{redleaf,theseus}, and is also incorporated
into the Linux kernel~\cite{rust-for-linux-lwn}.
% This is because Rust happens to provide the desired properties -- its
% language-based safety can be leveraged for safe kernel extensions without a
% verifier.
In \projname{}, Rust provides either the desired properties directly or
essential building blocks to ensure the important safety properties for
kernel extensions (\S\ref{sec:background}).
% Here, we list the important safety properties
% (\jinghao{may need to move to \S\ref{sec:background}})
% \begin{itemize}
% \item Memory safety
% \item Type safety
% \item Safe resource management
% \item Exception-based runtime safety
% \item Stack safety
% \item Program termination
% \end{itemize}
% \jinghao{The runtime exception handling is not necessarily an independent one,
% it actually covers some of the type safety (arrays and slices does runtime
% checks that may trigger exceptions)}
\begin{figure}
\includegraphics[width=1.0\linewidth]{figs/overview.pdf}
\centering
\vspace{-10pt}
\caption{Overview of \projname{}}
\label{fig:rex-overview}
\vspace{-10pt}
\end{figure}
We now discuss how the language-based safety from Rust can be applied to the
context of kernel extensions to provide a safe programming interface.
% While the use of Rust automatically provides expressiveness (as it is
% Turing-complete), it does not supply the safety out-of-box, and the use of
% Rust, especially unsafe Rust, can still exhibit undesired behaviors.
% We require all extension programs to be implemented \emph{only} in safe Rust.
% On top of that, we discuss how the safety properties from Rust can be leveraged
% and applied to the context of kernel extensions to provide a safe
% programming interface.
Figure~\ref{fig:rex-overview} provides an overview of the \projname{} framework.
The extension program is written strictly in \emph{safe} Rust.
It is compiled by the Rust compiler, which not only ensures safety properties
in Rust but also runs a \projname{}-specific pass for kernel stack safety.
The program interacts with the kernel through a program interface implemented
by the \projname{} \emph{kernel crate}.
The kernel crate contains a mixture of unsafe Rust code due to its role as a
bridge between the extension program, which is expected to be checked for
safety, and the unsafe but trusted kernel code.
It also provides a custom Rust panic handler to support panic-based runtime
safety checks in Rust.
The program links with the kernel crate at compile time and runs in a
light-weight runtime environment implemented in the kernel, which provides
program termination and stack unwinding support.
\subsection{Memory safety}
\label{principle:memsafety}
Generally, kernel extensions are not allowed to access kernel memory.
However, there are frequent cases that an extension program needs to work on a
shared buffer of data or a specific struct, and exchange data with the
kernel.
% These regions of memory are accessed by both the program via helper functions
% and the kernel.
% Rust provides powerful primitives that help to validate the memory is
% accessed correctly.
There are usually two common patterns on how the memory is accessed, differing
by the owner of the memory: 1) Memory owned by the extension program (e.g.,
an on-stack buffer) is sent to the kernel through the helper function
interface. 2) Memory owned by the kernel (e.g., an in-kernel struct) is
accessed from the extension program. % direct packet access, map data ptr
In the first case, the extension program may allocate some memory on the stack
and send it to the kernel for processing (e.g., asking kernel to fill a
stack buffer with some data).
An unsafe memory access, for example, an out-of-bounds write to the on-stack
buffer, could result in corruption of stack data and possibly trigger a
kernel crash if the return address is overwritten to a non-present page.
For eBPF, the verifier validates the memory region with the size to make sure
that both the kernel and the program do not make erroneous accesses.
For the memory buffers sent to the kernel through the helper function interface,
the corresponding size is also sent as an argument to the helper.
The verifier, at program load time, checks to ensure that the size is exactly
that of the memory buffer.
Then, at runtime, the kernel can operate on the buffer with the knowledge of
its bounds, thereby avoid unsafe memory accesses.
In \projname{}, the strict type system of Rust already protects the program
from making an unsafe access.
\projname{} leverages the generic programming feature of Rust to ensure the
size sent through the helper function interface is always valid.
% In order to make sure that the size sent to the kernel is always correct, which
% is vital for the correctness of memory access from the kernel, we leverage
% the generic programming support of Rust.
For kernel helper functions that take in a
program-supplied pointer and size, the \projname{} kernel crate creates an
adaptor interface that parametrizes the type of the pointer as a generic
type parameter.
The interface queries the size of the generic type from the compiler
and invokes the kernel interface with this size as the argument.
Since Rust employs monomorphization~\cite{rustc-monomorphize}, the concrete
type and its size is resolved at compile time, adding no runtime overhead.
In this way, the size is guaranteed to match the type statically and the
kernel will never make an out-of-bound access.
This can work for not only scalar types, but array types as well: Rust
also supports ``const generics'' that allows a constant to be used as a
generic parameter, which can be used to encode the length of arrays.
In the second case, the kernel may provide the user program a pointer to
kernel memory to perform direct access (e.g., access through eBPF map value
pointers and packet pointers).
The user programs must not make an out-of-bound memory access for this
kernel-owned region of memory, as doing so risks corrupting kernel data.
In eBPF, for accesses on pointers with a static size, e.g. map value pointers,
as maps store the size of its value types, the verifier can independently
verify the validity.
For pointers without a static size like packet data pointers, the verifier
enforces programs to explicitly check for memory
boundaries and to not make out-of-bound accesses.
In \projname{}, pointers with a static size are handled through the Rust type
system, as the
kernel map interface of \projname{} encodes the key and value types through
generics.
The type system therefore forces the pointer to be a safe Rust reference of
the particular type.
Pointers referring to a memory region without a static size are generally
dynamic-sized buffers.
The \projname{} kernel crate abstracts such pointers into a Rust \emph{slice}
with dynamic size.
Rust slice provides runtime bounds checks (\S\ref{principle:eh}), which allows
the check to happen automatically without explicit handling from the
extension program.
\subsection{Extended type safety}
Kernel extensions may contain certain paradigms that are beyond the safe
type system of Rust.
One challenge is to allow extension programs to safely reinterpret the stream
of bytes into useful data.
Such cases are notably found in networking use cases, where a program may need
to extract the Ethernet header from a byte buffer in the packet.
Safety of these transformations is hard to reason about because they
inevitably involve unsafe type casting.
% For example, a XDP program using direct packet access might need to extract
% the ethernet header information from the bytes in the packet.
Currently, eBPF allows the program to freely interpret the packet data into
other data types via pointer casting.
The verifier checks the casting and ensures that 1) the program does not make a
pointer from scalar values and 2) the new type fits within the memory
boundaries.
\projname{} agrees with the verifier that as long as the above two properties
hold, the reinterpreting cast (dubbed ``transmute'' in Rust) is safe to
perform.
It therefore extends the type safety of Rust to cover such cases.
% In the case of Rust, the reinterpret cast (dubbed ``transmute'' in Rust) is an
% unsafe operation, particularly because Rust does not prevent making
% pointers from scalar values.
To allow extension programs to still safely reinterpret the bytes into useful
data types, \projname{} defines a group of primitive scalar types that are
considered safe as targets for casting.
\projname{} requires the target type of casting to be of either the safe types
or a structure type in which all the members are of the safe types.
The safe types are specified by implementing the \texttt{Rex::SafeTransmute}
trait, which is only implementable by code within the kernel crate.
We use the \emph{proc-macro} feature of Rust to enforce this constraint at
compile time.
Like conventional C macros, proc-macros performs transformation on the program,
albeit on the abstract syntax tree level.
Our proc-macro, when applied on a structure type, generates code that tries
to treat each field of the structure as an instance of
\texttt{Rex::SafeTransmute} using Rust trait bounds, followed by the actual
transmute operation to perform the unsafe cast.
If one of the fields in the structure does not implement
\texttt{Rex::SafeTransmute}, the Rust compiler will issue a compile error.
At the same time, if the program tries to directly transmute the bytes into a
structure without using the proc-macro, the compiler will also emit an
error because transmute belongs to unsafe Rust, which is forbidden in
\projname{} programs.
Since proc-macro transformations happen after the linting of unsafe operations,
the transmute code generated by the proc-macro will not be rejected and,
therefore, allows programs to perform transmutes in a safe, and controlled
way.
\jinghao{Shall we remove this sentence, people may ask whether programs can
use proc-macros to get around the safe Rust requirement}
\subsection{Safe resource management}
%Executing in the kernel,
Extension programs must also acquire and release
resources properly under kernel context.
Some kernel helper functions return kernel resources that require
explicit release after use through corresponding helper function calls.
Failing to do so will result in leakage of kernel resources such as reference
counts and acquired spinlocks.
% In eBPF, some kernel helper functions returns kernel resources that requires
% explicit release after use (e.g., reference counts of kernel objects,
% locks).
In these situations, the eBPF verifier checks and ensures that the resource is
released on all possible code paths to prevent leaking kernel resources.
\projname{} uses the resource-acquisition-is-initialization
(RAII)~\cite{rust-raii} pattern of Rust: for each kernel resource that
extension programs need, \projname{} creates an RAII wrapper type that ties
the resource with the lifetime of the wrapper object.
For example, when the program obtains a spinlock from the kernel, the
\projname{} kernel crate constructs and returns a \emph{lock guard}.
The lock guard implements the RAII semantics through the
\texttt{drop}~\cite{rust-drop} trait in
Rust, which defines the operation to perform when an object is destroyed.
In the case of lock guard, its \texttt{drop} handler releases the lock.
The program does not need to explicitly release the lock or drop the lock
guard, instead, the compiler inserts an implicit \texttt{drop} at the end
of the current scope. This effectively releases the lock when the lock
guard goes out of scope.
\subsection{Runtime exception handling}
\label{principle:eh}
While the compiler contributes a lot to the safety of Rust, many safety checks
in Rust happen at runtime in the form of exceptions (i.e., Rust panics),
which is different from eBPF.
In userspace, Rust uses the Itanium exception handling ABI~\cite{itanium-abi}.
When an exception is triggered, the control flow is transferred to the Rust
panic handler in its standard library, which in turn calls into the unwind
library to perform stack unwinding and resource cleanups for each stack
frame.
However, the Itanium ABI is complex and not suitable for kernel extensions,
making exception handling a challenge in \projname{} extension programs:
\jinghao{Took these directly from the HotOS paper, shall we paraphrase?}
\begin{itemize}
% \item The Itanium ABI-based exception handling is too complicated, as the
% userspace unwind libraries are not direcly usable.
\item Failures during unwinding, which are permissible in userspace, cannot
be tolerated in kernel space, as incomplete cleanup means leaking
kernel resources.
% \item ABI-based unwinding typically requires dynamic allocation, which
% creates challenges for extensions in interrupt contexts, in which an
% allocator may not be available~\cite{bpf-mempool-lwn}. \jinghao{I'm
% thinking about removing this one since bpf now has an allocator.}
\item Unwinding generally executes destructors for all existing objects on
the stack, but executing untrusted, user-defined destructors (via the
\texttt{\small Drop} trait in Rust) is not safe.
\end{itemize}
\begin{figure}
\includegraphics[width=1.0\linewidth]{figs/exception_handling.pdf}
\centering
\vspace{-10pt}
\caption{Exception handling control flow in \projname{}}
\label{fig:eh-overview}
\vspace{-10pt}
\end{figure}
Our exception handling framework consists of two components: 1) graceful exit
upon exceptions that resets the current context and 2) resource cleanup to
ensure release of kernel resources like reference counts and locks.
\para{Graceful exit:}
To support a graceful exit from an exception, \projname{} implements a small
runtime (as shown in Figure~\ref{fig:eh-overview}) in the kernel that
contains a program dispatcher, a panic handler, and a landingpad function.
The dispatcher function takes the same duty of executing the extension program
as the existing eBPF dispatcher.
It saves the old stack and frame pointer of the current context into per-CPU
memory before switching to the dedicated program stack
(\S\ref{principle:stack}) and calling into the program.
If the program exits normally, without triggering an exception, it would
return to the dispatcher, which switches the stack back.
Under exceptional cases where a Rust panic is triggered, the panic handler will
release the kernel resources currently allocated by the program, and
transfer the control flow to the in-kernel landingpad function to print
debug information related to the exception (e.g., exception message) to the
kernel ring buffer.
Then, the landingpad function redirects the control flow to a pre-defined label
in the middle of the dispatcher function, where it restores the old value
of the stack and frame pointer from the per-CPU storage.
This effectively unwinds the stack and resets the context as if the extension
program returned successfully.
% We implement the program dispatcher function with panic handling in the kernel
% in 28 lines of x86 assembly code.
\para{Resource cleanup:}
Light-weight mechanisms can be effective for resource cleanup in \projname{}.
Our insight for resource cleanup is that the extension program can only obtain
resources by explicitly issuing helper function calls and therefore only
these resources need to be released.
For these helper function calls, we record the allocated kernel resources
during execution into a statically defined per-CPU buffer.
Our current implementation can support up to 64
instances of kernel resources during a single run.
Upon a panic, the panic handler will take the responsibility to correctly
handle the release of kernel resources, which involves traversing the
buffer and dropping the recorded resources.
% performing cleanup for each of the recorded resources.
% \jinghao{It seems that nesting should be put somewhere, due to the fact that
% these per-CPU tricks depend on no-nesting.}
We implement the cleanup code as part of the \projname{} kernel crate, as it
is the core component responsible for coordinating helper function calls
that obtain kernel resources.
% In this way, the resource cleanup is completely transparent to the extension
% programs, adding no additional programming burden like the Itanium ABI.
The other reason for implementing cleanup mechanism inside the kernel crate is
safety, because such code is called upon panic, it must not trigger yet
another Rust panic and cause panic handling to fail.
Therefore, \projname{} does not execute user-supplied \texttt{drop} handlers
upon panic, as they are not guarranteed to be safe under panic handling
context.
%and programs cannot allocate resources without helper function
% calls.
\subsection{Kernel stack safety}
\label{principle:stack}
One unique safety requirement that kernel extension programs face is the
usage of kernel stack.
Unlike userspace stacks, which grows on demand with a large maximum size,
the stack in kernel space is fix-sized (4 pages on x86-64) and limited.
Overflowing the kernel stack may result in memory corruptions or kernel panics.
The eBPF verifier handles stack safety by keeping track of the stack size
throughout its symbolic execution process.
However, the verification of the stack usage does not always yield the correct
result -- the verifier has been shown to fail to track stack
usage across indirect tail calls~\cite{ebpf-stackoverflow}.
Our observation is that the enforcement of stack safety can be easily performed
at compile time if the extension program does not have indirect or
recursive calls, and conversely, it is easier to perform such check at
runtime when the program does employ indirect or recursive calls.
\projname{}, therefore, takes a hybrid approach in ensuring stack safety that
utilizes static checks for programs without indirect or recursive calls,
and dynamic, runtime checks for programs make use of such function calls.
For each program being compiled, the \projname{}-specific pass in the compiler
will check whether it involves indirect or recursive calls.
When the extension program contains only direct, non-recursive function calls,
its total stack usage can be calculated by traversing its static callgraph
and sum up the size of each call frame.
If the total stack usage of the program exceeds total amount of stack
available, the \projname{} compiler pass will generate an error and reject
the program.
On the other hand, programs with indirect or recursive calls are hard to check
for stack usage statically, as is the case in the eBPF verifier.
In these cases, \projname{} performs runtime checks to limit the stack usage of
programs.
The \projname{} compiler pass first ensures each function in the program takes
less than one page (4K) of stack.
Then, before each function call in the extension program, the pass inserts a
call to the \texttt{rex\_check\_stack} function.
\texttt{rex\_check\_stack} is provided by the kernel crate and checks the
current stack depth of the program, if the stack usage exceeds the
threshold, it will trigger a Rust panic and terminate the
program safely (\S\ref{principle:eh}).
In order for the \texttt{rex\_check\_stack} function to have more control
on the stack usage of the program and also support programs with slightly
large stack usage, \projname{} implements a dedicated kernel stack for the
extension programs.
The dedicated kernel stacks are allocated per-CPU and virtually mapped during
kernel boot time with a size of 4 pages.
% the same size as the kernel IRQ and task stacks.
In the dispatcher function (Figure~\ref{fig:eh-overview}), before calling into
the \projname{} program, the stack and frame pointers of the current
context are saved.
The dispatcher function then sets the stack and frame pointer registers to the
top of the dedicated stack and invokes the program.
Upon program exit, the original stack and frame pointers are restored, no
matter whether a exception is triggered during program execution.
\projname{} defines the stack usage threshold to be 2 pages less than the
total stack available.
This design choice is based on two considerations.
First, kernel helper functions are not visible at program compile time but they
also account for stack usage during program execution.
Leaving extra spaces accomodates kernel helper functions and we believe two
pages of stack is a reasonable limit for kernel helper functions.
Second, since the stack usage of each function in the program is limited to
1 page of stack, in the worse case, the remaining stack space is at least
one page when \texttt{rex\_check\_stack} triggers a Rust panic.
This worse case guarantee leaves enough space for the panic handling and stack
unwinding routines.
Such a dynamic mechanism allows \projname{} to achieve better stack-safety than
eBPF, which only relies on static verification.
\subsection{Bounded program runtime}
\label{principle:termination}
% A bounded program runtime is important for kernel safety
% - a potentially long running program can hold the CPU and cause kernel
% lockups
% eBPF handles this by imposing verification limit (ref S3)
% (Side note:
% First, this is really ``bounded runtime'' rather than termination, because
% a long-running program that eventually terminate is almost equally bad
% Why would programs ever hit that limit if they are supposed to have an
% acceptable runtime? Is it because of these two reasons:
% 1. The verifier misses some key information that can reduce the search
% space drastically, as is the case of unbounded loops?
% 2. If the program is meant to execute that many instructions (current
% limit is 1M), is it because the current limitation is still too small
% or is it that the program is badly implemented and actually runs
% longer? (Also, how long would 1M eBPF instructions run with
% interpretation and JIT?)
% )
% - A program that goes over the limit gets rejected, no matter whether it will
% actually run long
% - creates usability issues
% - previous works has also demonstrated ways of creating long running programs
% without going over the limit (cite raj-lpc and jinghao-hotos)
%
% Given the ineffectiveness and usability issues associated with the static
% verification approach, Rex uses a dynamic mechanism that interrupts and
% terminates programs.
% When a program goes over its time limit, it will be terminated.
% The dynamic termination in Rex limits the runtime of the program through a
% timer.
% When the timer expires, it issues an IPI to the CPU the program is running.
% The IPI effectively suspends the program and saves its registers to the stack.
% Inside the IPI handler, the timeout handler for Rex is executed, which
% modifies the instruction pointer stored on the stack to the panic handler
% of rex programs.
% After returning from the IPI the program will be executing the panic handler
% to gracefully exit.
%
% Challenge: no termination in helpers/panic handlers
%
% We use a per-CPU flag to solve this
Ensuring a bounded runtime for extension programs is critical to the safety of
the kernel.
A long-running extension program could hold the processor for a long time and
cause kernel lockups.
eBPF addresses potentially long-running programs by imposing a static limit on
the number of instructions the verifier would go through.
A program that goes over the instruction limit will be rejected by the
verifier, no matter whether the program will actually run for a long time.
This techinque creates many false-positives and is one of the main sources of
the usability issues in eBPF, as discussed in
\S\ref{motivation:restructure}.
Previous works~\cite{ebpf-termination,untenableVerification} have also
demonstrated creation of long-running eBPF programs without going over the
verification limits, further weakening the runtime guarantee from the
verifier.
Given the usability issues and ineffectiveness associated with the static
verification approach, \projname{} employs a dynamic mechanism that
interrupts and terminates long-running programs at runtime.
% \projname{} supports dynamic program termination through an additional API in the kernel's
% BPF syscall interface.
The dynamic termination in \projname{} limits the runtime of the program
through a timer.
When the timer expires, an inter-processor interrupt (IPI) is sent to the CPU
the program is running on.
This IPI effectively suspends the program and pushes its registers onto the
stack.
Inside the IPI handler, the timeout handler of \projname{} is
executed, which sets the saved instruction pointer register to the
panic handler of the program.
After returning from the IPI, the program will be executing the panic handler,
which cleans up kernel resources allocated and gracefully exits the program.
% The termination logic uses the Linux kernel's Inter-Process Interrupt (IPI) mechanism
% to raise an interrupt on the target \projname{} program's CPU.
% Depending on the use-case, the IPI can be triggered by an operator or a timer within the system
% that is configured to notify when the \projname{} exceeds a certain runtime threshold.
% Note that we are assuming the availability of a free CPU for an operator to be able to invoke
% an IPI.
% In uniprocessor machines, where the only CPU will be busy running the extension, the operator
% won't be able to issue termination request and has to rely on installing timers.
% We find this limitation acceptable due to the current cloud infrastructure predominantly being SMP.
% To safely terminate a \projname{} extension, we need to ensure the following :
One challenge involved in terminating a \projname{} program safely is that a
program should never be terminated while executing a kernel helper
function or the panic handler, as doing so disrupts internal bookkeeping of
the kernel (e.g. acquired resources) and safe exception handling process of
\projname{}.
To address this challenge, \projname{} defers termination when the program is
running kernel helper functions, and does not attempt to terminate if the
program is already in the panic handler on the exit path.
% \begin{enumerate}
% \item A \projname{} program should never be terminated while executing a
% kernel helper function or the panic handler, as doing so disrupts
% internal bookkeeping of the kernel (e.g. acquired resources) and safe
% exception handling process of \projname{}.
% This is neccesary to ensure kernel objects acquired during a helper call
% are freed.
% \item Kernel resources allocated within the extension program cannot be
% left unreleased and cause resource leaks.
% \end{enumerate}
% The raised interrupt handler first detaches the \projname{} program from its hookpoint to prevent
% further invocation.
% The handler then changes the saved registers(in the interrupt stack) from the \projname{} context
% to point to the \projname{} panic handler(section \ref{principle:eh}).
% The CPU returns the execution to the panic handler which performs the cleanup of
% any kernel objects that were live at the time of termination.
% To address the case when the target \projname{} program could be inside a helper/panic handler,
% \begin{figure}
% \includegraphics[width=1.0\linewidth]{figs/BPF_termination_state_diagram.pdf}
% \centering
% %\vspace{-10pt}
% \caption{Termination state diagram for \projname{} extensions}
% \label{fig:rex-termination}
% %\vspace{-30pt}
% \end{figure}
\projname{} uses a per-CPU tristate flag to keep track the state that the
extension program is currently in. % (Figure~\ref{fig:rex-termination}).
The states are: 1) executing extension code, 2) executing kernel helpers or
panic handlers, and 3) termination requested.
Every helper invocation changes the state from state 1) to state 2).
When excuting the timeout handler, if the flag is at state 2),
which indicates the program is under a context that cannot be safely
terminated, the termination handler modifies it to state 3) without
setting the instruction pointer.
On a helper function return, the flag is checked and if it is set to state 3),
the panic handler is called to gracefully exit the program.
% This mechanism is used to defer a panic invocation until the end of a helper execution.
% \jinghao{Shall we have a state diagram?}