Lace 2.0 #10

trolando · 2022-09-09T18:08:07Z

Lace 2.0 is a redesign of the API, work in progress. The problem to be solved is that in its current (pre-2.0) state, Lace is quite intrusive for a developer as it uses macros to replace normal function definitions. While some amount of intrusion is inevitable, one might say it is excessive in the current form.

Example:

TASK_1(int, fib, int, n) {
    if (n<2) return n;
    SPAWN(fib, n-1);
    int a = CALL(fib, n-2);
    int b = SYNC(fib);
    return a+b;
}

In a package such as Sylvan, this means that large parts of the source code consists of these kind of macros. This is clearly difficult to read at first.

The reason for these macros is that this allows automatic injection of the pointers to the lace worker struct and the task queue head as parameters of Lace tasks. This shaves off a tiny bit of work because in particular the queue head does not need to be stored in memory, except possibly in the program stack. This efficiency is obvious in very fine grained parallelism with tiny tasks, such as fibonacci and n-queens.

An easier API would be this:

TASK_1(int, fib, int, n)

int fib(int n) {
    fib_SPAWN(n-1);
    int a = fib(n-2);
    int b = fib_SYNC();
    return a+b;
}

In this case, the lace worker struct is a thread-local variable, which needs to be obtained each time SPAWN and SYNC are used, and the task queue head is stored inside the worker struct. So this requires a write operation to this worker struct each time SPAWN and SYNC are used, which was avoided before.

We still need some way of defining the helper functions for work stealing, which is now a TASK macro that simply defines a number of relatively small static inline functions, such as fib_SPAWN, fib_SYNC and something like fib_RUN that ensures that the fib function is executed by a Lace worker. One can also just call fib directly, which required a CALL macro before.

What does this mean for performance?
I ran a number of benchmarks on my desktop computer running Linux (Manjaro). See the following.

Results after 5 iterations:
Using -O2, gcc 12.1.1 and clang 14.0.6

benchmark       : gcc lace-1 : gcc lace-2 : clang lace-1 : clang lace-2
cholesky-1      :  6.69      :  6.53      :  5.13        :  5.45
cholesky-2      :  3.41      :  3.32      :  2.62        :  2.79
cholesky-6      :  1.24      :  1.25      :  1.00        :  1.02
cholesky-seq-1  :  6.94      :  6.97      :  4.49        :  4.51
fib-1           :  8.14      : 10.86      : 10.84        : 16.78
fib-2           :  4.12      :  5.53      :  5.54        :  8.56
fib-6           :  1.55      :  2.05      :  1.95        :  2.96
fib-seq-1       :  3.74      :  3.71      :  5.15        :  5.15
heat-1          :  0.73      :  0.73      :  0.50        :  0.50
heat-2          :  0.40      :  0.40      :  0.36        :  0.36
heat-6          :  0.35      :  0.35      :  0.35        :  0.35
heat-seq-1      :  0.72      :  0.72      :  0.50        :  0.50
integrate-1     :  2.84      :  2.98      :  2.99        :  3.41
integrate-2     :  1.45      :  1.51      :  1.52        :  1.74
integrate-6     :  0.50      :  0.53      :  0.53        :  0.60
integrate-seq-1 :  2.67      :  2.69      :  2.67        :  2.68
matmul-1        :  5.25      :  5.24      :  4.87        :  4.83
matmul-2        :  2.67      :  2.67      :  2.49        :  2.45
matmul-6        :  0.93      :  0.95      :  0.86        :  0.85
matmul-seq-1    :  5.16      :  5.24      :  4.94        :  4.80
nqueens-1       :  4.81      :  6.09      :  5.31        :  5.55
nqueens-2       :  2.46      :  3.10      :  2.71        :  2.83
nqueens-6       :  0.90      :  1.22      :  0.98        :  1.02
nqueens-seq-1   :  4.87      :  4.88      :  4.06        :  4.05
uts-t2l-1       : 21.04      : 21.27      : 21.26        : 21.29
uts-t2l-2       : 10.71      : 10.80      : 10.90        : 10.85
uts-t2l-6       :  3.74      :  3.81      :  3.75        :  3.75
uts-t2l-seq-1   : 20.83      : 20.63      : 20.29        : 20.34
uts-t3l-1       : 16.73      : 16.88      : 16.95        : 16.97
uts-t3l-2       :  8.50      :  8.59      :  8.64        :  8.66
uts-t3l-6       :  2.95      :  3.00      :  2.97        :  3.00
uts-t3l-seq-1   : 16.52      : 16.48      : 16.28        : 16.38

From these benchmarks, it is obvious that the very fine grained benchmarks fib and nqueens suffer a performance degradation of about 30%, and these benchmarks consist of billion of nearly empty tasks. The same holds to a lesser degree for integrate, which is a program that does some arithmetic work without accessing memory. The other benchmarks are not significantly affected by this change, in particular the challenging uts program (unbalanced tree search) has no significant difference in performance or speedup.

Notice also that the performance difference is in the case of fib equal to the difference of using another compiler. The reason that the fibonacci benchmark is so fast with current Lace is due to a performance optimization that gcc does better than clang. Although a caveat here is that I compiled with -O2 and not -O3, maybe I should rerun all benchmarks again with -O3.

The question is now whether or not to go ahead with this new design, at the cost of performance of certain niche benchmark programs.

dumblob · 2022-09-10T14:12:51Z

Interesting. I actually thought about this the first time I saw lace. But I quickly realized it is by far not as bad with macros as it looked like on the first sight.

One way or another I would certainly welcome running the benchmarks with -O3 as that will give us more insight. I am afraid that it will not improve much (but only "cosmetically") on -O2.

So instead I will play the devil's advocate and propose to stay with macros 😉. The sole reasoning to slightly decrease the use of macros feels kind of niche to me if there is no real issues with them in practice. I for myself would be more interested e.g. in power saving by spinning up&down threads in runtime as discussed in #9 😉.

To support the macros side of view feel free to take a look at https://github.com/Hirrolot/interface99 and https://github.com/Hirrolot/datatype99 for modern macro library for C.

trolando · 2022-09-10T20:29:45Z

Regarding #9, if no tasks are running, then you can already use lace_stop and lace_start to change the number of workers. What could be easy to implement is to have some maximum amount of workers and start as many Lace threads, then have a desired number of workers as follows:

if the desired number is above the active number of workers, wake up some Lace workers
if the desired number is below the active number of workers, then when a worker has no more work, it goes to sleep

The hard part is if we would want to remove workers on-the-fly while they are doing work, which means this work will need to be executed by another worker and cleanly disposing of current work will be a challenge; I don't know if there is an efficient way to do this. Also, I'm not sure if Lace is the right framework to do this. Lace is based on busy waiting workers, that's not suitable for mobile platforms. The intended application is large computations performed on multi-core machines.

Regarding the linked repositories, I like those READMEs and maybe I should improve the README.md on my own projects a bit.

Regarding the Lace 2.0 API, so far I find it cleaner. The TASK macro now simply defines a few additional methods and can just be added to header files, and I suspect that it is also easier for IDEs to provide code suggestions, although I have not checked this as I use vim for C development... I am curious what users of Sylvan think of changing Lace in this way, such as @tquatmann, @sjunges, @SSoelvsten.

sjunges · 2022-09-10T20:38:21Z

Yes, the macros feel painful at first, but Lace is a pretty low-level library anyways and it will be mostly be used by people that want to optimize for performance. I don't know whether changing the API helps that much: Maybe just adding this to the documentation would help for people to understand what the "c-style" semantics are of this macro.

The points made in #9 reflect quite well with the discussion we are having in the linked storm thread.

sjunges · 2022-09-10T20:45:00Z

@trolando Happy birthday, btw! :-)

SSoelvsten · 2022-09-12T10:15:12Z

The following should be taken with quite some reservations, considering I've only been reading Sylvan's source code, not actually writing anything using Lace.

I would not say that the new API is drastically better or is drastically better to read - slightly, but not a lot. The fib_SPAWN and fib_SYNC functions still kinda come out of the blue, and they still require an understanding of worker-stealing. The auto-completion in Visual Studio Code (and presumably similar) does work slightly better, but type inference seems to work as well for both.

As Sebastian Junges also said: maybe one cannot make Lace much better at the level of abstraction it is at. A lot can probably be done merely by providing more documentation: for example expand the fibonacci example into a larger Markdown example, with some drawings, and introducing one concept (SPAWN+SYNC, CALL etc.) at a time.

In many cases, the pattern is always "spawn n-1 recursive calls and run the n'th on the same thread". I wonder whether it is possible to provide a higher-level abstraction of this pattern, i.e. something of the form.

TASK_1(int, fib, int, n)
{
    int a, b = fib(n-1, n-2);
    return a+b;
}

Finally, you may have a better idea of whether this is a good idea after looking at how this affects Sylvan, i.e. both its code clarity and its performance?

trolando · 2022-09-13T10:53:04Z

Improving the documentation is a good idea. I've updated the README.md on the master branch.

The original reason for working on a new API is that the macros can be confusing for novice developers working on Sylvan or LTSmin for their student projects.

SSoelvsten · 2022-11-16T12:10:17Z

The par construct that was just added to the Flix language might also be relevant for an intuitive design

https://github.com/flix/flix/blob/master/main/src/library/RedBlackTree.flix#L423

trolando mentioned this pull request Sep 9, 2022

Sylvan upgrade moves-rwth/storm#273

Merged

dumblob mentioned this pull request Sep 12, 2022

Efficiency on manycore/fabric CPUs #9

Open

trolando force-pushed the lace2 branch 3 times, most recently from 2fba2e9 to 9ffb6f0 Compare February 16, 2023 08:32

trolando force-pushed the master branch from 99e28b2 to e67dd9f Compare February 16, 2023 08:36

trolando force-pushed the lace2 branch 3 times, most recently from 769dda4 to cdc7f3e Compare February 16, 2023 12:13

trolando force-pushed the master branch from f5b8caa to e3b0152 Compare February 16, 2023 12:14

trolando force-pushed the lace2 branch from cdc7f3e to 6a16a40 Compare February 16, 2023 12:14

trolando force-pushed the lace2 branch from 6a16a40 to ac10ed5 Compare March 6, 2023 09:23

trolando force-pushed the master branch from 9ea8d23 to be0786d Compare March 6, 2023 16:59

trolando force-pushed the master branch 3 times, most recently from 64149a7 to c01df0e Compare October 25, 2023 22:53

trolando and others added 5 commits April 12, 2024 10:58

New benchmarks imported from nuwa repository

bb38bb0

Work on Lace

5f8c5a8

Modified benchmarks for Lace 2.0 API

60da124

Redesign to supply LaceWorker* in every task

07a1ad7

Fix the test programs test_suspend and test_together

15613e5

trolando force-pushed the lace2 branch from ac10ed5 to 15613e5 Compare April 13, 2024 21:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lace 2.0 #10

Lace 2.0 #10

trolando commented Sep 9, 2022

dumblob commented Sep 10, 2022

trolando commented Sep 10, 2022

sjunges commented Sep 10, 2022

sjunges commented Sep 10, 2022

SSoelvsten commented Sep 12, 2022 •

edited

Loading

trolando commented Sep 13, 2022 •

edited

Loading

SSoelvsten commented Nov 16, 2022 •

edited

Loading

Lace 2.0 #10

Are you sure you want to change the base?

Lace 2.0 #10

Conversation

trolando commented Sep 9, 2022

dumblob commented Sep 10, 2022

trolando commented Sep 10, 2022

sjunges commented Sep 10, 2022

sjunges commented Sep 10, 2022

SSoelvsten commented Sep 12, 2022 • edited Loading

trolando commented Sep 13, 2022 • edited Loading

SSoelvsten commented Nov 16, 2022 • edited Loading

SSoelvsten commented Sep 12, 2022 •

edited

Loading

trolando commented Sep 13, 2022 •

edited

Loading

SSoelvsten commented Nov 16, 2022 •

edited

Loading