Skip to content

Commit a155765

Browse files
committed
Add vectorization sample.
This adds a new generic vectorization sample to replace hello-neon. Most importantly, it covers non-Neon options for SIMD. One of the useful things it shows, for example, is that there's actually no reason to write SIMD code the way that hello-neon does any more. This also solves a much simpler problem (small matrix multiplication), which makes it easier to see how to deal with the SIMD features rather than figuring out what a FIR filter is. Finally, this sample benchmarks each of the implementations so it's obvious what is and isn't worth doing. I was sort of surprised that auto-vectorization didn't do better, and was pleased to learn that there's no reason at all to write Neon intrinsics. I'll delete hello-neon after this merges and I've fixed up the doc links. #1011
1 parent a5f12fb commit a155765

36 files changed

+1351
-0
lines changed

gradle/libs.versions.toml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,13 @@ material = "1.12.0"
1414
jetbrainsKotlinJvm = "1.7.21"
1515
oboe = "1.8.1"
1616

17+
activityCompose = "1.9.0"
18+
composeBom = "2023.08.00"
19+
coreKtx = "1.13.1"
1720
curl = "7.79.1-beta-1"
1821
googletest = "1.11.0-beta-1"
1922
jsoncpp = "1.9.5-beta-1"
23+
lifecycleRuntimeKtx = "2.7.0"
2024
openssl = "1.1.1q-beta-1"
2125

2226
[libraries]
@@ -40,6 +44,17 @@ openssl = { group = "com.android.ndk.thirdparty", name = "openssl", version.ref
4044

4145
# build-logic dependencies
4246
android-gradlePlugin = { group = "com.android.tools.build", name = "gradle", version.ref = "agp" }
47+
androidx-activity-compose = { group = "androidx.activity", name = "activity-compose", version.ref = "activityCompose" }
48+
androidx-compose-bom = { group = "androidx.compose", name = "compose-bom", version.ref = "composeBom" }
49+
androidx-core-ktx = { group = "androidx.core", name = "core-ktx", version.ref = "coreKtx" }
50+
androidx-lifecycle-runtime-ktx = { group = "androidx.lifecycle", name = "lifecycle-runtime-ktx", version.ref = "lifecycleRuntimeKtx" }
51+
androidx-material3 = { group = "androidx.compose.material3", name = "material3" }
52+
androidx-ui = { group = "androidx.compose.ui", name = "ui" }
53+
androidx-ui-graphics = { group = "androidx.compose.ui", name = "ui-graphics" }
54+
androidx-ui-test-junit4 = { group = "androidx.compose.ui", name = "ui-test-junit4" }
55+
androidx-ui-test-manifest = { group = "androidx.compose.ui", name = "ui-test-manifest" }
56+
androidx-ui-tooling = { group = "androidx.compose.ui", name = "ui-tooling" }
57+
androidx-ui-tooling-preview = { group = "androidx.compose.ui", name = "ui-tooling-preview" }
4358

4459
[plugins]
4560
android-application = { id = "com.android.application", version.ref = "agp" }

settings.gradle

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,3 +63,4 @@ include(":teapots:image-decoder")
6363
include(":teapots:more-teapots")
6464
include(":teapots:textured-teapot")
6565
include(":unit-test:app")
66+
include(":vectorization")

vectorization/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/build

vectorization/README.md

Lines changed: 182 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,182 @@
1+
# Vectorization
2+
3+
This sample shows how to implement matrix multiplication using various
4+
vectorization approaches.
5+
6+
Note: You should not reuse this matrix library in your application. It was not
7+
written to be useful beyond the scope of this demo. If you're looking for a
8+
matrix library, you probably want [GLM] for graphics applications, or a linear
9+
algebra library such as BLAS for compute applications.
10+
11+
The sample app will benchmark each implementation and display the average run
12+
time over 1,000,000 runs. The goal of this sample is to illustrate the trade-
13+
offs of each implementation in terms of flexibility, readability, and
14+
performance.
15+
16+
Given the relatively small problem size used here (4x4 matrices and vec4s), the
17+
best performing implementations in this sample are the ones that can best
18+
improve over the naive implementation without large set up costs. You should not
19+
take the results of this sample as authoritative: if performance is important to
20+
you, you **must** benchmark your code for workloads realistic for your app.
21+
22+
If you're not familiar with it [Godbolt] is an invaluable tool for examining
23+
compiler optimizer behavior. You could also use `$NDK_BIN/clang -S -O2 -o -`
24+
from the command line for a local workflow.
25+
26+
## Implementations
27+
28+
This sample contains the following implementations. Each of their trade-offs are
29+
discussed briefly, but as mentioned above, you should not rely on the
30+
performance results measured here to make a decision for your app.
31+
32+
### Auto-vectorization
33+
34+
See [auto_vectorization.h] for the implementation.
35+
36+
This implementation is written in generic C++ and contains no explicit SIMD. The
37+
only vectorization that will be performed is Clang's auto-vectorization. This
38+
makes for the most portable code and readable code, but at the cost of
39+
performance.
40+
41+
See https://llvm.org/docs/Vectorizers.html for Clang's docs about
42+
auto-vectorization.
43+
44+
### std::simd
45+
46+
This isn't actually available yet. It's an experimental part of the C++ standard
47+
and is in development in libc++, but NDK r27 happened to catch it right in the
48+
middle of a rewrite, so it's not currently usable.
49+
50+
See https://en.cppreference.com/w/cpp/experimental/simd/simd.
51+
52+
### Clang vectors
53+
54+
See [clang_vector.h] for the implementation.
55+
56+
This implementation uses Clang's generic vector types. This code is mostly as
57+
portable as the auto-vectorization implementation, with the only caveat being
58+
that it is limited by the width of the vector registers for the target hardware.
59+
To deal with problems that don't fit in the target's vector registers, you would
60+
need to either alter the algorithm to tile the operations, or use Scalable
61+
Vector Extensions (AKA [SVE]).
62+
63+
However, the benefit of the portability trade-off is that this does outperform
64+
the auto-vectorization implementation.
65+
66+
See
67+
https://clang.llvm.org/docs/LanguageExtensions.html#vectors-and-extended-vectors.
68+
69+
### Clang matrices
70+
71+
See [matrix.h] for the implementation. This is the default implementation for
72+
`Matrix::operator*`, so unlike the others that file contains the rest of the
73+
`Matrix` class as well.
74+
75+
This implementation uses Clang's built-in matrix type. This is an experimental
76+
feature in Clang, but it has the simplest code (because some kind Clang person
77+
wrote the hard part) and performs the best by a wide margin. There are
78+
implementation defined limits on the size of the matrix, but within those limits
79+
the code is as portable as the auto-vectorization implementation. The docs say
80+
the feature is still under development and subject to change, so be wary of
81+
using this in production, and definitely don't use these types as part of your
82+
ABI.
83+
84+
See https://clang.llvm.org/docs/LanguageExtensions.html#matrix-types for more
85+
details.
86+
87+
### OpenMP SIMD
88+
89+
See [omp_simd.h] for the implementation.
90+
91+
This implementation uses OpenMP's SIMD directive. For some reason this
92+
under-performs even the auto-vectorized implementation. There are a lot of
93+
additional specifiers that can be added to the simd directive that would maybe
94+
improve this implementation. Patches welcome :)
95+
96+
See https://www.openmp.org/spec-html/5.0/openmpsu42.html for more information.
97+
98+
## Alternatives not shown here
99+
100+
There are other approaches that could be used that aren't shown here.
101+
102+
### Neon
103+
104+
A Neon implementation would be nearly identical to the one in [clang_vector.h].
105+
The only difference is how the vector type is specified. A lot of older Neon
106+
sample code looks substantially different because it uses the Neon intrinsics
107+
defined in `arm_neon.h`, but if you look at how the intrinsics in that file are
108+
defined, all they actually do (for a little endian system, and Android does not
109+
support big endian, so we can ignore that caveat) is use the `*` operator and
110+
leave the correct instruction selection up to Clang.
111+
112+
In other words, you should probably never use the Neon-specific approach. The
113+
generated code should be identical to code written with Clang's arch-generic
114+
vectors. If you rewrite the [clang_vector.h] implementation to use Neon's
115+
`float32x4_t` instead of the Clang vector, the results are identical.
116+
117+
### SVE
118+
119+
[SVE] scales SIMD to arbitrarily sized vectors, and the C extensions, while
120+
making for less concise code than is needed for a constrained vector size like
121+
we have here, handle windowing of data to fit the hardware vector size for you.
122+
For problems like the small matrix multiply we do here, it's overkill. For
123+
portability across various vector widths for the Arm CPUs that support SVE, it
124+
can reduce the difficulty of writing SIMD code.
125+
126+
### GPU acceleration
127+
128+
GPU acceleration is a better fit for large data sets. That approach isn't shown
129+
here because it's substantially more code to set up the GPU for this
130+
computation, and our data size is so small that the cost of GPU initialization
131+
and streaming the data to the GPU is likely to make that a net-loss. If you want
132+
to learn more about GPU compute, see https://vulkan-tutorial.com/Compute_Shader,
133+
https://www.khronos.org/opengl/wiki/Compute_Shader, and
134+
https://www.khronos.org/opencl/ (while OpenCL is not guaranteed to be available
135+
for all Android devices, it is a very common OEM extension).
136+
137+
## Function multi-versioning
138+
139+
There are two compiler attributes that can be helpful for targeting specific
140+
hardware features when optimizing hot code paths: [target] and [target_clones],
141+
both of which may be referred to as "function multiversioning" or "FMV". Each
142+
solves a slightly different but related problem.
143+
144+
The `target` attribute makes it easier to write multiple implementations for a
145+
function that should be selected based on the runtime hardware. If benchmarking
146+
shows that one implementation performs better on armv8.2 and a different
147+
implementation performs better on armv8 (see the docs for more details on
148+
specific targeting capabilities), you can write the function twice, annotate
149+
them with the appropriate `__attribute__((target(...)))` tag, and the compiler
150+
will auto-generate the code to select the best-fitting implementation at runtime
151+
(it uses ifuncs under the hood, so the branch is resolved once at library load
152+
time rather than for each call).
153+
154+
The `target_clones` attribute, on the other hand, allows you to write the
155+
function once but instruct the compiler to generate multiple variants of the
156+
function for each requested target. This means that, for example, if you've
157+
requested both `default` and `armv8.2`, the compiler will generate a default
158+
implementation compatible with all Android devices, as well as a second
159+
implementation that uses instructions available in armv8.2 but not available in
160+
the base armv8 ABI. As with the `target` attribute, Clang will automatically
161+
select the best-fitting implementation at runtime. Using `target_clones` is the
162+
same as using `target` with identical function bodies.
163+
164+
Note that with both of these approaches, testing becomes more difficult because
165+
you will need a greater variety of hardware to test each code path. If you're
166+
already doing fine grained targeting like this, that isn't a new problem, and
167+
using one or both of these attributes may help you simplify your implementation.
168+
169+
Neither of these techniques are shown in this sample. We don't have access to
170+
enough hardware to benchmark or verify multiple implementations, and (as of NDK
171+
r27, at least), Clang doesn't support `target_clones` on templated functions.
172+
173+
[auto_vectorization.h]: src/main/cpp/auto_vectorization.h
174+
[clang_vector.h]: src/main/cpp/clang_vector.h
175+
[GLM]: https://github.com/g-truc/glm
176+
[Gobolt]: https://godbolt.org/
177+
[matrix.h]: src/main/cpp/matrix.h
178+
[neon.h]: src/main/cpp/neon.h
179+
[omp_simd.h]: src/main/cpp/omp_simd.h
180+
[SVE]: https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions
181+
[target_clones]: https://clang.llvm.org/docs/AttributeReference.html#target-clones
182+
[target]: https://clang.llvm.org/docs/AttributeReference.html#target

vectorization/build.gradle.kts

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
plugins {
2+
id("ndksamples.android.application")
3+
id("ndksamples.android.kotlin")
4+
}
5+
6+
android {
7+
namespace = "com.android.ndk.samples.vectorization"
8+
9+
defaultConfig {
10+
applicationId = "com.android.ndk.samples.vectorization"
11+
12+
vectorDrawables {
13+
useSupportLibrary = true
14+
}
15+
}
16+
17+
externalNativeBuild {
18+
cmake {
19+
path = file("src/main/cpp/CMakeLists.txt")
20+
}
21+
}
22+
23+
buildFeatures {
24+
compose = true
25+
prefab = true
26+
}
27+
28+
composeOptions {
29+
kotlinCompilerExtensionVersion = "1.5.1"
30+
}
31+
}
32+
33+
dependencies {
34+
implementation(project(":base"))
35+
implementation(libs.androidx.core.ktx)
36+
implementation(libs.androidx.lifecycle.runtime.ktx)
37+
implementation(libs.androidx.activity.compose)
38+
implementation(platform(libs.androidx.compose.bom))
39+
implementation(libs.androidx.ui)
40+
implementation(libs.androidx.ui.graphics)
41+
implementation(libs.androidx.ui.tooling.preview)
42+
implementation(libs.androidx.material3)
43+
debugImplementation(libs.androidx.ui.tooling)
44+
debugImplementation(libs.androidx.ui.test.manifest)
45+
}
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
<?xml version="1.0" encoding="utf-8"?>
2+
<manifest xmlns:android="http://schemas.android.com/apk/res/android">
3+
4+
<application
5+
android:allowBackup="true"
6+
android:icon="@mipmap/ic_launcher"
7+
android:label="@string/app_name"
8+
android:roundIcon="@mipmap/ic_launcher_round"
9+
android:supportsRtl="true"
10+
android:theme="@style/Theme.NDKSamples">
11+
<activity
12+
android:name=".VectorizationActivity"
13+
android:exported="true"
14+
android:label="@string/app_name"
15+
android:theme="@style/Theme.NDKSamples">
16+
<intent-filter>
17+
<action android:name="android.intent.action.MAIN" />
18+
19+
<category android:name="android.intent.category.LAUNCHER" />
20+
</intent-filter>
21+
</activity>
22+
</application>
23+
24+
</manifest>
Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
cmake_minimum_required(VERSION 3.22.1)
2+
project(Vectorization LANGUAGES CXX)
3+
4+
add_compile_options(-Wall -Wextra -Werror)
5+
6+
find_package(base REQUIRED CONFIG)
7+
8+
add_library(app
9+
SHARED
10+
benchmark.cpp
11+
jni.cpp
12+
)
13+
14+
target_compile_features(app PUBLIC cxx_std_23)
15+
target_compile_options(app PUBLIC -fenable-matrix -fopenmp)
16+
17+
target_link_libraries(app
18+
PRIVATE
19+
base::base
20+
log
21+
)
22+
23+
target_link_options(app
24+
PRIVATE
25+
-flto
26+
-Wl,--version-script,${CMAKE_SOURCE_DIR}/libapp.map.txt
27+
)
28+
29+
set_target_properties(app
30+
PROPERTIES
31+
LINK_DEPENDS ${CMAKE_SOURCE_DIR}/libapp.map.txt
32+
)
Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
/*
2+
* Copyright (C) 2024 The Android Open Source Project
3+
*
4+
* Licensed under the Apache License, Version 2.0 (the "License");
5+
* you may not use this file except in compliance with the License.
6+
* You may obtain a copy of the License at
7+
*
8+
* http://www.apache.org/licenses/LICENSE-2.0
9+
*
10+
* Unless required by applicable law or agreed to in writing, software
11+
* distributed under the License is distributed on an "AS IS" BASIS,
12+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
* See the License for the specific language governing permissions and
14+
* limitations under the License.
15+
*/
16+
17+
#pragma once
18+
19+
#include <stdint.h>
20+
21+
#include "matrix.h"
22+
23+
namespace samples::vectorization {
24+
25+
/**
26+
* Multiplies two compatible matrices and returns the result.
27+
*
28+
* @tparam T The type of each matrix cell.
29+
* @tparam M The number of rows in the left operand and the result.
30+
* @tparam N The number of columns in the left operand, and the rows in the
31+
* right operand.
32+
* @tparam P The number of columns in the right operand and the result.
33+
* @param lhs The left operand.
34+
* @param rhs The right operand.
35+
* @return The result of lhs * rhs.
36+
*/
37+
template <typename T, size_t M, size_t N, size_t P>
38+
Matrix<M, P, T> MultiplyWithAutoVectorization(const Matrix<M, N, T>& lhs,
39+
const Matrix<N, P, T>& rhs) {
40+
// This may look like an unfair benchmark because this implementation uses the
41+
// less vector friendly one than the others, however, using the vector
42+
// friendly algorithm here actually made performance worse.
43+
//
44+
// This is a good illustration of why it's important to benchmark your own
45+
// code and not rely on what someone else tells you about which works best: it
46+
// depends.
47+
//
48+
// It's probably also worth mentioning that if what you need is *consistent*
49+
// performance across compiler versions, the only real choice you have is
50+
// writing assembly. Even the instruction intrinsics (at least for Neon) are
51+
// subject to the compiler's instruction selection. That will be overkill for
52+
// most users, since it's substantially more difficult to write and maintain,
53+
// but is how you'll see some code bases deal with this (codecs in particular
54+
// are willing to make that trade-off).
55+
Matrix<M, P, T> result;
56+
for (auto i = 0U; i < M; i++) {
57+
for (auto j = 0U; j < P; j++) {
58+
T sum = {};
59+
for (auto k = 0U; k < N; k++) {
60+
sum += lhs.get(i, k) * rhs[k, j];
61+
}
62+
result[i, j] = sum;
63+
}
64+
}
65+
return result;
66+
}
67+
68+
} // namespace samples::vectorization

0 commit comments

Comments
 (0)