-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster INTT on AArch64 by more efficient reduction #773
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
9618 cycles |
9542 cycles |
1.01 |
ML-KEM-512 encaps |
11339 cycles |
11327 cycles |
1.00 |
ML-KEM-512 decaps |
15492 cycles |
15376 cycles |
1.01 |
ML-KEM-768 keypair |
16263 cycles |
16316 cycles |
1.00 |
ML-KEM-768 encaps |
17757 cycles |
17831 cycles |
1.00 |
ML-KEM-768 decaps |
23496 cycles |
23580 cycles |
1.00 |
ML-KEM-1024 keypair |
22112 cycles |
22152 cycles |
1.00 |
ML-KEM-1024 encaps |
24065 cycles |
24168 cycles |
1.00 |
ML-KEM-1024 decaps |
31776 cycles |
31703 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A76 (Raspberry Pi 5) benchmarks
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29538 cycles |
29505 cycles |
1.00 |
ML-KEM-512 encaps |
35007 cycles |
35114 cycles |
1.00 |
ML-KEM-512 decaps |
45706 cycles |
45774 cycles |
1.00 |
ML-KEM-768 keypair |
50239 cycles |
50334 cycles |
1.00 |
ML-KEM-768 encaps |
55962 cycles |
55745 cycles |
1.00 |
ML-KEM-768 decaps |
70866 cycles |
70755 cycles |
1.00 |
ML-KEM-1024 keypair |
73398 cycles |
73356 cycles |
1.00 |
ML-KEM-1024 encaps |
82268 cycles |
82201 cycles |
1.00 |
ML-KEM-1024 decaps |
102336 cycles |
102476 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 4th gen (c7i) (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
28633 cycles |
28653 cycles |
1.00 |
ML-KEM-512 encaps |
34208 cycles |
34255 cycles |
1.00 |
ML-KEM-512 decaps |
43478 cycles |
43584 cycles |
1.00 |
ML-KEM-768 keypair |
48302 cycles |
48250 cycles |
1.00 |
ML-KEM-768 encaps |
55703 cycles |
55738 cycles |
1.00 |
ML-KEM-768 decaps |
67180 cycles |
67077 cycles |
1.00 |
ML-KEM-1024 keypair |
71636 cycles |
71690 cycles |
1.00 |
ML-KEM-1024 encaps |
82712 cycles |
82407 cycles |
1.00 |
ML-KEM-1024 decaps |
98058 cycles |
97983 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
16128 cycles |
16124 cycles |
1.00 |
ML-KEM-512 encaps |
18394 cycles |
18402 cycles |
1.00 |
ML-KEM-512 decaps |
24939 cycles |
24944 cycles |
1.00 |
ML-KEM-768 keypair |
27813 cycles |
27769 cycles |
1.00 |
ML-KEM-768 encaps |
29511 cycles |
29530 cycles |
1.00 |
ML-KEM-768 decaps |
38916 cycles |
38914 cycles |
1.00 |
ML-KEM-1024 keypair |
37614 cycles |
38663 cycles |
0.97 |
ML-KEM-1024 encaps |
40627 cycles |
40654 cycles |
1.00 |
ML-KEM-1024 decaps |
53237 cycles |
53187 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
17248 cycles |
17280 cycles |
1.00 |
ML-KEM-512 encaps |
19052 cycles |
19237 cycles |
0.99 |
ML-KEM-512 decaps |
24553 cycles |
24621 cycles |
1.00 |
ML-KEM-768 keypair |
29322 cycles |
29442 cycles |
1.00 |
ML-KEM-768 encaps |
30467 cycles |
30600 cycles |
1.00 |
ML-KEM-768 decaps |
38613 cycles |
38302 cycles |
1.01 |
ML-KEM-1024 keypair |
43219 cycles |
43169 cycles |
1.00 |
ML-KEM-1024 encaps |
44840 cycles |
44787 cycles |
1.00 |
ML-KEM-1024 decaps |
55165 cycles |
55157 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
11639 cycles |
11606 cycles |
1.00 |
ML-KEM-512 encaps |
13110 cycles |
13113 cycles |
1.00 |
ML-KEM-512 decaps |
18047 cycles |
18015 cycles |
1.00 |
ML-KEM-768 keypair |
20110 cycles |
20084 cycles |
1.00 |
ML-KEM-768 encaps |
21274 cycles |
22172 cycles |
0.96 |
ML-KEM-768 decaps |
28105 cycles |
28123 cycles |
1.00 |
ML-KEM-1024 keypair |
26912 cycles |
26752 cycles |
1.01 |
ML-KEM-1024 encaps |
28942 cycles |
29029 cycles |
1.00 |
ML-KEM-1024 decaps |
38524 cycles |
38635 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
18045 cycles |
18000 cycles |
1.00 |
ML-KEM-512 encaps |
21432 cycles |
21444 cycles |
1.00 |
ML-KEM-512 decaps |
28159 cycles |
28166 cycles |
1.00 |
ML-KEM-768 keypair |
30999 cycles |
31057 cycles |
1.00 |
ML-KEM-768 encaps |
34120 cycles |
34016 cycles |
1.00 |
ML-KEM-768 decaps |
43793 cycles |
43900 cycles |
1.00 |
ML-KEM-1024 keypair |
44854 cycles |
44869 cycles |
1.00 |
ML-KEM-1024 encaps |
50236 cycles |
50304 cycles |
1.00 |
ML-KEM-1024 decaps |
63280 cycles |
63219 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intel Xeon 3rd gen (c6i) (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
47206 cycles |
47215 cycles |
1.00 |
ML-KEM-512 encaps |
55741 cycles |
55769 cycles |
1.00 |
ML-KEM-512 decaps |
71369 cycles |
71385 cycles |
1.00 |
ML-KEM-768 keypair |
76599 cycles |
76671 cycles |
1.00 |
ML-KEM-768 encaps |
87355 cycles |
87457 cycles |
1.00 |
ML-KEM-768 decaps |
107988 cycles |
108062 cycles |
1.00 |
ML-KEM-1024 keypair |
112075 cycles |
112216 cycles |
1.00 |
ML-KEM-1024 encaps |
126310 cycles |
126357 cycles |
1.00 |
ML-KEM-1024 decaps |
152593 cycles |
152695 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 3rd gen (c6a) (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
39826 cycles |
39831 cycles |
1.00 |
ML-KEM-512 encaps |
48288 cycles |
48310 cycles |
1.00 |
ML-KEM-512 decaps |
62486 cycles |
62593 cycles |
1.00 |
ML-KEM-768 keypair |
64685 cycles |
64743 cycles |
1.00 |
ML-KEM-768 encaps |
75528 cycles |
75852 cycles |
1.00 |
ML-KEM-768 decaps |
94582 cycles |
94857 cycles |
1.00 |
ML-KEM-1024 keypair |
96216 cycles |
96185 cycles |
1.00 |
ML-KEM-1024 encaps |
109711 cycles |
109744 cycles |
1.00 |
ML-KEM-1024 decaps |
133301 cycles |
133361 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AMD EPYC 4th gen (c7a) (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
36460 cycles |
36362 cycles |
1.00 |
ML-KEM-512 encaps |
42901 cycles |
42854 cycles |
1.00 |
ML-KEM-512 decaps |
56011 cycles |
55874 cycles |
1.00 |
ML-KEM-768 keypair |
59042 cycles |
58969 cycles |
1.00 |
ML-KEM-768 encaps |
67558 cycles |
67369 cycles |
1.00 |
ML-KEM-768 decaps |
84638 cycles |
84352 cycles |
1.00 |
ML-KEM-1024 keypair |
88274 cycles |
88273 cycles |
1.00 |
ML-KEM-1024 encaps |
98962 cycles |
98757 cycles |
1.00 |
ML-KEM-1024 decaps |
120516 cycles |
120450 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
19167 cycles |
19153 cycles |
1.00 |
ML-KEM-512 encaps |
22854 cycles |
22843 cycles |
1.00 |
ML-KEM-512 decaps |
30189 cycles |
30135 cycles |
1.00 |
ML-KEM-768 keypair |
32820 cycles |
32889 cycles |
1.00 |
ML-KEM-768 encaps |
36544 cycles |
36454 cycles |
1.00 |
ML-KEM-768 decaps |
47018 cycles |
47149 cycles |
1.00 |
ML-KEM-1024 keypair |
47343 cycles |
47269 cycles |
1.00 |
ML-KEM-1024 encaps |
53229 cycles |
53300 cycles |
1.00 |
ML-KEM-1024 decaps |
67274 cycles |
67339 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton4 (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
35827 cycles |
35799 cycles |
1.00 |
ML-KEM-512 encaps |
40772 cycles |
40763 cycles |
1.00 |
ML-KEM-512 decaps |
52120 cycles |
52119 cycles |
1.00 |
ML-KEM-768 keypair |
59147 cycles |
59152 cycles |
1.00 |
ML-KEM-768 encaps |
66656 cycles |
66727 cycles |
1.00 |
ML-KEM-768 decaps |
81285 cycles |
81291 cycles |
1.00 |
ML-KEM-1024 keypair |
89023 cycles |
88951 cycles |
1.00 |
ML-KEM-1024 encaps |
98816 cycles |
98864 cycles |
1.00 |
ML-KEM-1024 decaps |
117768 cycles |
117736 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
29538 cycles |
29526 cycles |
1.00 |
ML-KEM-512 encaps |
35011 cycles |
35047 cycles |
1.00 |
ML-KEM-512 decaps |
45727 cycles |
45779 cycles |
1.00 |
ML-KEM-768 keypair |
50417 cycles |
50393 cycles |
1.00 |
ML-KEM-768 encaps |
56157 cycles |
55853 cycles |
1.01 |
ML-KEM-768 decaps |
71270 cycles |
70784 cycles |
1.01 |
ML-KEM-1024 keypair |
73395 cycles |
73370 cycles |
1.00 |
ML-KEM-1024 encaps |
82286 cycles |
82215 cycles |
1.00 |
ML-KEM-1024 decaps |
102373 cycles |
102518 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton3 (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
39015 cycles |
39009 cycles |
1.00 |
ML-KEM-512 encaps |
44897 cycles |
44884 cycles |
1.00 |
ML-KEM-512 decaps |
56737 cycles |
56729 cycles |
1.00 |
ML-KEM-768 keypair |
64350 cycles |
64398 cycles |
1.00 |
ML-KEM-768 encaps |
71784 cycles |
71990 cycles |
1.00 |
ML-KEM-768 decaps |
87754 cycles |
87873 cycles |
1.00 |
ML-KEM-1024 keypair |
96092 cycles |
96101 cycles |
1.00 |
ML-KEM-1024 encaps |
106189 cycles |
106189 cycles |
1 |
ML-KEM-1024 decaps |
127092 cycles |
126808 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Graviton2 (no-opt)
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
59702 cycles |
59700 cycles |
1.00 |
ML-KEM-512 encaps |
68325 cycles |
68303 cycles |
1.00 |
ML-KEM-512 decaps |
87055 cycles |
87054 cycles |
1.00 |
ML-KEM-768 keypair |
99116 cycles |
99376 cycles |
1.00 |
ML-KEM-768 encaps |
110671 cycles |
110537 cycles |
1.00 |
ML-KEM-768 decaps |
135571 cycles |
135383 cycles |
1.00 |
ML-KEM-1024 keypair |
148896 cycles |
148972 cycles |
1.00 |
ML-KEM-1024 encaps |
164325 cycles |
164558 cycles |
1.00 |
ML-KEM-1024 decaps |
196240 cycles |
195742 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SpacemiT K1 8 (Banana Pi F3) benchmarks
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
226497 cycles |
226490 cycles |
1.00 |
ML-KEM-512 encaps |
271273 cycles |
271255 cycles |
1.00 |
ML-KEM-512 decaps |
345181 cycles |
345253 cycles |
1.00 |
ML-KEM-768 keypair |
374422 cycles |
374467 cycles |
1.00 |
ML-KEM-768 encaps |
432997 cycles |
433065 cycles |
1.00 |
ML-KEM-768 decaps |
530612 cycles |
530815 cycles |
1.00 |
ML-KEM-1024 keypair |
557579 cycles |
557749 cycles |
1.00 |
ML-KEM-1024 encaps |
633774 cycles |
633982 cycles |
1.00 |
ML-KEM-1024 decaps |
756847 cycles |
757425 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
54741 cycles |
52947 cycles |
1.03 |
ML-KEM-512 encaps |
62065 cycles |
61278 cycles |
1.01 |
ML-KEM-512 decaps |
78999 cycles |
79155 cycles |
1.00 |
ML-KEM-768 keypair |
90084 cycles |
90196 cycles |
1.00 |
ML-KEM-768 encaps |
98334 cycles |
98261 cycles |
1.00 |
ML-KEM-768 decaps |
122559 cycles |
122604 cycles |
1.00 |
ML-KEM-1024 keypair |
135169 cycles |
135659 cycles |
1.00 |
ML-KEM-1024 encaps |
149008 cycles |
147673 cycles |
1.01 |
ML-KEM-1024 decaps |
181956 cycles |
181195 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03
.
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
54741 cycles |
52947 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arm Cortex-A55 (Snapdragon 888) benchmarks
Benchmark suite | Current: 9145809 | Previous: 10cccc5 | Ratio |
---|---|---|---|
ML-KEM-512 keypair |
59408 cycles |
59416 cycles |
1.00 |
ML-KEM-512 encaps |
67091 cycles |
67070 cycles |
1.00 |
ML-KEM-512 decaps |
86482 cycles |
86322 cycles |
1.00 |
ML-KEM-768 keypair |
101062 cycles |
100949 cycles |
1.00 |
ML-KEM-768 encaps |
112048 cycles |
112004 cycles |
1.00 |
ML-KEM-768 decaps |
139354 cycles |
139239 cycles |
1.00 |
ML-KEM-1024 keypair |
153486 cycles |
153412 cycles |
1.00 |
ML-KEM-1024 encaps |
170148 cycles |
170984 cycles |
1.00 |
ML-KEM-1024 decaps |
206992 cycles |
207582 cycles |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rod-chapman Good catch, this may indeed be a little improvement, but there's more work to do in this PR?
Yes... is "slothy-cli" installed by the NIX environment? If not, then what platform should I be running SLOTHY on? |
cee0db3
to
918d5b2
Compare
624b800
to
64bd87e
Compare
Signed-off-by: Rod Chapman <rodchap@amazon.com>
64bd87e
to
2f4c339
Compare
This PR introduces a small change in AArch64 intt_clean.S that reduces the number of intermediate
reduction steps from 4 to 3, through more thorough tracking of coefficient bounds.
TODO