Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generalized 2- and 3-layer merging of the forward NTT #784

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rod-chapman
Copy link
Contributor

This PR adds functions in poly.c that perform generalized 2- and 3-layer merges for the forward NTT.
This adds to the existing single-layer processing function, so that many different "merge strategies"
can be implemented and measures for performance on different platforms.

This commit initially implements a "3211" merging that has been found experimentally to be reasonably efficient
on AArch64 platforms using GCC 13 and 14.

CBMC proofs of these new functions are TBD.

Further benchmarks and experiments will determine the best layer merge for additional targets and compilers.

@rod-chapman rod-chapman added benchmark this PR should be benchmarked in CI aarch64 labels Feb 18, 2025
@rod-chapman rod-chapman requested a review from a team as a code owner February 18, 2025 15:22
@rod-chapman rod-chapman added benchmark this PR should be benchmarked in CI and removed benchmark this PR should be benchmarked in CI labels Feb 18, 2025
Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 9522 cycles 9502 cycles 1.00
ML-KEM-512 encaps 11254 cycles 11305 cycles 1.00
ML-KEM-512 decaps 15353 cycles 15303 cycles 1.00
ML-KEM-768 keypair 16322 cycles 16793 cycles 0.97
ML-KEM-768 encaps 18634 cycles 18519 cycles 1.01
ML-KEM-768 decaps 23559 cycles 24419 cycles 0.96
ML-KEM-1024 keypair 22088 cycles 22134 cycles 1.00
ML-KEM-1024 encaps 24053 cycles 24138 cycles 1.00
ML-KEM-1024 decaps 31797 cycles 31717 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 29538 cycles 29507 cycles 1.00
ML-KEM-512 encaps 35003 cycles 35111 cycles 1.00
ML-KEM-512 decaps 45689 cycles 45732 cycles 1.00
ML-KEM-768 keypair 50196 cycles 50348 cycles 1.00
ML-KEM-768 encaps 55914 cycles 55796 cycles 1.00
ML-KEM-768 decaps 70910 cycles 70726 cycles 1.00
ML-KEM-1024 keypair 73419 cycles 73382 cycles 1.00
ML-KEM-1024 encaps 82249 cycles 82192 cycles 1.00
ML-KEM-1024 decaps 102310 cycles 102432 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 26995 cycles 28613 cycles 0.94
ML-KEM-512 encaps 34298 cycles 34470 cycles 1.00
ML-KEM-512 decaps 42158 cycles 43677 cycles 0.97
ML-KEM-768 keypair 44931 cycles 48417 cycles 0.93
ML-KEM-768 encaps 54395 cycles 55797 cycles 0.97
ML-KEM-768 decaps 64927 cycles 66997 cycles 0.97
ML-KEM-1024 keypair 66604 cycles 71667 cycles 0.93
ML-KEM-1024 encaps 80453 cycles 82614 cycles 0.97
ML-KEM-1024 decaps 93797 cycles 98027 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 17241 cycles 17273 cycles 1.00
ML-KEM-512 encaps 19090 cycles 19072 cycles 1.00
ML-KEM-512 decaps 24524 cycles 24524 cycles 1
ML-KEM-768 keypair 29741 cycles 29736 cycles 1.00
ML-KEM-768 encaps 30729 cycles 30758 cycles 1.00
ML-KEM-768 decaps 38381 cycles 38358 cycles 1.00
ML-KEM-1024 keypair 43096 cycles 43117 cycles 1.00
ML-KEM-1024 encaps 44776 cycles 44798 cycles 1.00
ML-KEM-1024 decaps 55106 cycles 55115 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 11496 cycles 11488 cycles 1.00
ML-KEM-512 encaps 13142 cycles 13158 cycles 1.00
ML-KEM-512 decaps 18838 cycles 17997 cycles 1.05
ML-KEM-768 keypair 20119 cycles 20040 cycles 1.00
ML-KEM-768 encaps 22246 cycles 21141 cycles 1.05
ML-KEM-768 decaps 28105 cycles 28106 cycles 1.00
ML-KEM-1024 keypair 26900 cycles 26680 cycles 1.01
ML-KEM-1024 encaps 28820 cycles 28828 cycles 1.00
ML-KEM-1024 decaps 38242 cycles 38388 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'AMD EPYC 4th gen (c7a)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 decaps 18838 cycles 17997 cycles 1.05
ML-KEM-768 encaps 22246 cycles 21141 cycles 1.05

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 16120 cycles 16117 cycles 1.00
ML-KEM-512 encaps 18401 cycles 18381 cycles 1.00
ML-KEM-512 decaps 24914 cycles 24908 cycles 1.00
ML-KEM-768 keypair 27846 cycles 27838 cycles 1.00
ML-KEM-768 encaps 29482 cycles 29485 cycles 1.00
ML-KEM-768 decaps 39883 cycles 38889 cycles 1.03
ML-KEM-1024 keypair 37560 cycles 37615 cycles 1.00
ML-KEM-1024 encaps 40602 cycles 40590 cycles 1.00
ML-KEM-1024 decaps 53207 cycles 53211 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 33634 cycles 39833 cycles 0.84
ML-KEM-512 encaps 45211 cycles 48324 cycles 0.94
ML-KEM-512 decaps 56433 cycles 62599 cycles 0.90
ML-KEM-768 keypair 55707 cycles 64737 cycles 0.86
ML-KEM-768 encaps 71094 cycles 75488 cycles 0.94
ML-KEM-768 decaps 85181 cycles 94677 cycles 0.90
ML-KEM-1024 keypair 83910 cycles 96107 cycles 0.87
ML-KEM-1024 encaps 103465 cycles 109679 cycles 0.94
ML-KEM-1024 decaps 120754 cycles 133405 cycles 0.91

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 29546 cycles 29517 cycles 1.00
ML-KEM-512 encaps 35013 cycles 35125 cycles 1.00
ML-KEM-512 decaps 45696 cycles 45755 cycles 1.00
ML-KEM-768 keypair 50212 cycles 50366 cycles 1.00
ML-KEM-768 encaps 55917 cycles 55795 cycles 1.00
ML-KEM-768 decaps 70847 cycles 70709 cycles 1.00
ML-KEM-1024 keypair 73446 cycles 73389 cycles 1.00
ML-KEM-1024 encaps 82269 cycles 82219 cycles 1.00
ML-KEM-1024 decaps 102337 cycles 102480 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 27956 cycles 36369 cycles 0.77
ML-KEM-512 encaps 38694 cycles 42834 cycles 0.90
ML-KEM-512 decaps 47577 cycles 55846 cycles 0.85
ML-KEM-768 keypair 46768 cycles 58943 cycles 0.79
ML-KEM-768 encaps 61423 cycles 67340 cycles 0.91
ML-KEM-768 decaps 72363 cycles 84327 cycles 0.86
ML-KEM-1024 keypair 70756 cycles 88248 cycles 0.80
ML-KEM-1024 encaps 89874 cycles 98775 cycles 0.91
ML-KEM-1024 decaps 102937 cycles 120462 cycles 0.85

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 40356 cycles 47211 cycles 0.85
ML-KEM-512 encaps 52075 cycles 55838 cycles 0.93
ML-KEM-512 decaps 64587 cycles 71382 cycles 0.90
ML-KEM-768 keypair 67962 cycles 76763 cycles 0.89
ML-KEM-768 encaps 83200 cycles 87504 cycles 0.95
ML-KEM-768 decaps 99203 cycles 108230 cycles 0.92
ML-KEM-1024 keypair 99825 cycles 112313 cycles 0.89
ML-KEM-1024 encaps 120027 cycles 126644 cycles 0.95
ML-KEM-1024 decaps 140014 cycles 152852 cycles 0.92

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 19165 cycles 19133 cycles 1.00
ML-KEM-512 encaps 22852 cycles 22891 cycles 1.00
ML-KEM-512 decaps 30177 cycles 30193 cycles 1.00
ML-KEM-768 keypair 32825 cycles 32868 cycles 1.00
ML-KEM-768 encaps 36537 cycles 36553 cycles 1.00
ML-KEM-768 decaps 46903 cycles 46986 cycles 1.00
ML-KEM-1024 keypair 47428 cycles 47347 cycles 1.00
ML-KEM-1024 encaps 53389 cycles 53344 cycles 1.00
ML-KEM-1024 decaps 67299 cycles 67305 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 52833 cycles 59700 cycles 0.88
ML-KEM-512 encaps 64872 cycles 68295 cycles 0.95
ML-KEM-512 decaps 80212 cycles 87021 cycles 0.92
ML-KEM-768 keypair 89059 cycles 99395 cycles 0.90
ML-KEM-768 encaps 105508 cycles 110571 cycles 0.95
ML-KEM-768 decaps 124966 cycles 135300 cycles 0.92
ML-KEM-1024 keypair 135075 cycles 149084 cycles 0.91
ML-KEM-1024 encaps 157260 cycles 164648 cycles 0.96
ML-KEM-1024 decaps 181932 cycles 195981 cycles 0.93

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 35267 cycles 39012 cycles 0.90
ML-KEM-512 encaps 42996 cycles 44889 cycles 0.96
ML-KEM-512 decaps 52925 cycles 56724 cycles 0.93
ML-KEM-768 keypair 58850 cycles 64395 cycles 0.91
ML-KEM-768 encaps 68965 cycles 71973 cycles 0.96
ML-KEM-768 decaps 82396 cycles 87835 cycles 0.94
ML-KEM-1024 keypair 88836 cycles 96091 cycles 0.92
ML-KEM-1024 encaps 102569 cycles 106191 cycles 0.97
ML-KEM-1024 decaps 119511 cycles 127263 cycles 0.94

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 18051 cycles 18009 cycles 1.00
ML-KEM-512 encaps 21438 cycles 21445 cycles 1.00
ML-KEM-512 decaps 28139 cycles 28149 cycles 1.00
ML-KEM-768 keypair 30998 cycles 31056 cycles 1.00
ML-KEM-768 encaps 34112 cycles 34013 cycles 1.00
ML-KEM-768 decaps 43768 cycles 43876 cycles 1.00
ML-KEM-1024 keypair 44800 cycles 44862 cycles 1.00
ML-KEM-1024 encaps 50321 cycles 50303 cycles 1.00
ML-KEM-1024 decaps 63295 cycles 63198 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 32338 cycles 35801 cycles 0.90
ML-KEM-512 encaps 39020 cycles 40765 cycles 0.96
ML-KEM-512 decaps 48619 cycles 52101 cycles 0.93
ML-KEM-768 keypair 53880 cycles 59153 cycles 0.91
ML-KEM-768 encaps 64004 cycles 66727 cycles 0.96
ML-KEM-768 decaps 75983 cycles 81269 cycles 0.93
ML-KEM-1024 keypair 81936 cycles 88949 cycles 0.92
ML-KEM-1024 encaps 95299 cycles 98868 cycles 0.96
ML-KEM-1024 decaps 110717 cycles 117718 cycles 0.94

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 198776 cycles 226405 cycles 0.88
ML-KEM-512 encaps 257289 cycles 271229 cycles 0.95
ML-KEM-512 decaps 317440 cycles 345085 cycles 0.92
ML-KEM-768 keypair 332946 cycles 374731 cycles 0.89
ML-KEM-768 encaps 411833 cycles 433325 cycles 0.95
ML-KEM-768 decaps 488840 cycles 531048 cycles 0.92
ML-KEM-1024 keypair 502284 cycles 557388 cycles 0.90
ML-KEM-1024 encaps 606019 cycles 632861 cycles 0.96
ML-KEM-1024 decaps 700154 cycles 755949 cycles 0.93

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 54787 cycles 53210 cycles 1.03
ML-KEM-512 encaps 62222 cycles 61267 cycles 1.02
ML-KEM-512 decaps 79132 cycles 77993 cycles 1.01
ML-KEM-768 keypair 90084 cycles 90263 cycles 1.00
ML-KEM-768 encaps 98196 cycles 98074 cycles 1.00
ML-KEM-768 decaps 122400 cycles 121894 cycles 1.00
ML-KEM-1024 keypair 134683 cycles 134857 cycles 1.00
ML-KEM-1024 encaps 147356 cycles 147787 cycles 1.00
ML-KEM-1024 decaps 180749 cycles 180642 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks

Benchmark suite Current: b347854 Previous: 8b27230 Ratio
ML-KEM-512 keypair 59451 cycles 59424 cycles 1.00
ML-KEM-512 encaps 67046 cycles 67066 cycles 1.00
ML-KEM-512 decaps 86242 cycles 86211 cycles 1.00
ML-KEM-768 keypair 101010 cycles 100998 cycles 1.00
ML-KEM-768 encaps 112058 cycles 111961 cycles 1.00
ML-KEM-768 decaps 139377 cycles 139534 cycles 1.00
ML-KEM-1024 keypair 153449 cycles 153544 cycles 1.00
ML-KEM-1024 encaps 170266 cycles 172575 cycles 0.99
ML-KEM-1024 decaps 207045 cycles 207810 cycles 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@rod-chapman rod-chapman force-pushed the ntt_layers branch 5 times, most recently from b4f7315 to 54308a0 Compare February 21, 2025 13:53
NTT functions.

CBMC proofs of these new functions are all TBD.

For now, we call these in a "3,2,1,1" pattern to make sure
there are no unreferenced functions.

Signed-off-by: Rod Chapman <rodchap@amazon.com>

Correct one call from fqmul() to mlk_fqmul()

Signed-off-by: Rod Chapman <rodchap@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
aarch64 benchmark this PR should be benchmarked in CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants