-
Notifications
You must be signed in to change notification settings - Fork 21
AVX2 core for OGR-NG #12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…th string postfixes.
OGRNG Alignment changed to 32 Core added for 64 bit builds and selected as default for CPUs supporting AVX2
Tested on Haswell:
|
I just went to change the code to force selection of SSE2 for Haswell, but it appears that's already in place. I also noticed that the code puts Skylake is out into the same architecture group as Haswell so it will also use SSE2. I expect Skylake will perform similarly to Kaby Lake as they are supposed to be of the same microarchitecture and so should have similar instruction costs. If someone shows the core selection for Skylake should be AVX2, we can change its grouping. |
Running on: Compiled with:
They are neck-and-neck on my chip. SSE2 wins some, AVX2 wins others.
A sample where avx2 wins:
Power sample for sse2:
Power sample for sse2-lzcnt
Power sample for avx2:
|
FWIW Broadwell behaves like Haswell (no surprise there) Running on a nuc5i5 (i7-5557U CPU @ 3.10GHz)
|
I almost forgot, I had to make a minor change along the lines of yours to get this to compile on the above platform:
|
Thanks for the benchmark reports. It seems that the AVX2 core is only worthwhile on KabyLake and (probably) later. |
@craig-johnston Did you investigate
http://www.agner.org/optimize/instruction_tables.pdf |
I didn't. The code is the same generic function prologue used for most of the OGR-NG asm cores. It's not in the hot path, so the performance has very little impact when compared to the body, which will loop many thousands of times for each call to the function. |
I added an AVX2 implementation based on my SSE2 implementation for 64 bit clients. There is a minor, but reliable, improvement when running on a Kaby Lake processor. I expect that the AVX2 core will scale better than the SSE2 implementation as Intel releases new processors and improves the performance of AVX2.
I have only tested compilation using VS2015, but have updated the Linux Makefile (hopefully there will be no problem).
I have assumed that if AVX2 is available, it is preferable to use this core. It would be worth checking that assumption on some of the older architectures that support AVX2 (e.g. Haswell).
This includes the VS2015 and Kaby Lake branches I've sent pull requests for.
Let me know if there are changes you'd like me to make.
Craig.
Benchmark results on a Intel(R) Core(TM) i5-7600 CPU @ 3.50GHz:
[Apr 28 08:04:07 UTC] Automatic processor type detection found
an Intel Core iX-7xxx (Kaby Lake) processor.
[Apr 28 08:04:07 UTC] OGR-NG: using core #0 (FLEGE-64 2.0).
[Apr 28 08:04:27 UTC] OGR-NG: Benchmark for core #0 (FLEGE-64 2.0)
0.00:00:17.06 [57,077,774 nodes/sec]
[Apr 28 08:04:27 UTC] OGR-NG: using core #1 (cj-asm-generic).
[Apr 28 08:04:46 UTC] OGR-NG: Benchmark for core #1 (cj-asm-generic)
0.00:00:16.92 [63,158,500 nodes/sec]
[Apr 28 08:04:46 UTC] OGR-NG: using core #2 (cj-asm-sse2).
[Apr 28 08:05:06 UTC] OGR-NG: Benchmark for core #2 (cj-asm-sse2)
0.00:00:17.07 [82,854,657 nodes/sec]
[Apr 28 08:05:06 UTC] OGR-NG: using core #3 (cj-asm-sse2-lzcnt).
[Apr 28 08:05:25 UTC] OGR-NG: Benchmark for core #3 (cj-asm-sse2-lzcnt)
0.00:00:16.92 [81,372,634 nodes/sec]
[Apr 28 08:05:25 UTC] OGR-NG: using core #4 (cj-asm-avx2).
[Apr 28 08:05:44 UTC] OGR-NG: Benchmark for core #4 (cj-asm-avx2)
0.00:00:16.92 [83,726,597 nodes/sec]
[Apr 28 08:05:44 UTC] OGR-NG benchmark summary :
Default core : #4 (cj-asm-avx2) 83,726,597 nodes/sec
Fastest core : #4 (cj-asm-avx2) 83,726,597 nodes/sec