Skip to content

Commit 288ffc8

Browse files
committedFeb 27, 2025
Update documentation
1 parent ba15126 commit 288ffc8

File tree

4 files changed

+246
-19
lines changed

4 files changed

+246
-19
lines changed
 

Diff for: ‎docs/docs/benchmarking.md

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
# Benchmarking DFT
2+
3+
Accurate DFT benchmarking requires careful control of optimizations, CPU architecture, and compiler behavior. Follow these guidelines to ensure reliable performance measurements.
4+
5+
> [!note]
6+
> A robust FFT benchmark suite implementing all these techniques is published at https://github.com/kfrlib/fft-benchmark
7+
8+
- Ensure that the optimized version of each library is used. If the vendor provides prebuilt binaries, use them.
9+
- For KFR, the official binaries can be found at: https://github.com/kfrlib/kfr/releases.
10+
- To verify that KFR is optimized for maximum performance, call:
11+
12+
```c++
13+
library_version()
14+
```
15+
Example output:
16+
```
17+
KFR 6.1.1 optimized sse2 [sse2, sse41, avx, avx2, avx512] 64-bit (clang-msvc-19.1.0/windows) +in +ve
18+
```
19+
The output must include the `optimized` flag and must not contain the `debug` flag.
20+
21+
- For libraries that support dynamic CPU dispatch, ensure that the best available architecture for your CPU is selected at runtime. Refer to the library documentation to learn how to verify this.
22+
- For KFR, call:
23+
24+
```c++
25+
cpu_runtime()
26+
```
27+
This function returns the selected architecture, such as `avx2`, `avx512`, or `neon`/`neon64` (for ARM).
28+
29+
- Ensure that no emulation is involved. For example, use native `arm64` binaries for Apple M-series CPUs.
30+
31+
- Exclude plan creation from the time measurements.
32+
33+
- Ensure that the compiler does not optimize out the DFT code. Add code that utilizes the output data in some way to prevent the compiler from optimizing away the computation.
34+
35+
- Perform multiple invocations to obtain reliable results. A few seconds of execution time is the minimum requirement for accurate measurements.
36+
37+
- Use the median or minimum of all measured execution times rather than the simple mean, as this better protects against unexpected spikes and benchmarking noise.
38+

Diff for: ‎docs/docs/src.md

+52-1
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,55 @@
11
# How to do Sample Rate Conversion
22

3+
## How to apply a Sample Rate Conversion to a contiguous signal?
34

4-
[See also a gallery with results of applying various SRC presets](src_gallery.md)
5+
For a continuous signal, the same instance of the `samplerate_converter` class should be used across all subsequent calls, rather than creating a new instance for each fragment. In the case of stereo audio, two instances (one per channel) are required.
6+
7+
The `samplerate_converter` class supports both `push` and `pull` methods for handling data flow.
8+
9+
- **`push`**: Input data of a fixed size is provided, and all available output data is received.
10+
**Example**: Processing audio from a microphone, where the sound device sends data in fixed-size chunks.
11+
12+
- **`pull`**: An output buffer of a fixed size is provided, and the necessary amount of input data is processed to generate the required output.
13+
**Example**: Streaming audio at a different sample rate to a sound device, where a specific number of output samples must be generated to fill the device buffer.
14+
15+
Let’s consider the case of resampling 44.1 kHz to 96 kHz with an output buffer of 512 samples (`pull`).
16+
The corresponding input size should be 235.2, which is not an integer.
17+
18+
The `samplerate_converter` class processes signals split into buffers of different sizes by maintaining an internal state.
19+
20+
To determine the required input buffer size for the next call to `process`, `input_size_for_output` can be used by passing the desired output buffer length. This will return either 236 or 235 samples in the 44.1khz to 96khz scenario.
21+
22+
The `process` function accepts two parameters:
23+
- `output`: Output buffer, provided as a univector with the desired size (512).
24+
- `input`: Input buffer, provided as a univector of at least the size returned by `input_size_for_output`. The resampler consumes the necessary number of samples to generate 512 output samples and returns the number of input samples read. The input should be adjusted accordingly to skip these samples.
25+
26+
For the `push` method, call `output_size_for_input` with the size of your input buffer. This function returns the corresponding output buffer size required to receive all pending output.
27+
28+
### Example (pull)
29+
30+
```c++
31+
// Initialization
32+
auto src = samplerate_converter<double>(sample_rate_conversion_quality::high, output_samplerate, input_samplerate);
33+
34+
void process_chunk(univector_ref<double> output) {
35+
univector<double> input(src.input_size_for_output(output.size()));
36+
// fill `input` with input samples
37+
src.process(output, input);
38+
// `output` now contains resampled version of input
39+
}
40+
```
41+
42+
### Example (push)
43+
44+
```c++
45+
// Initialization
46+
auto src = samplerate_converter<double>(resample_quality::high, output_sr, input_sr);
47+
48+
void process_chunk(univector_ref<const double> input) {
49+
univector<double> output(src.output_size_for_input(input.size()));
50+
src.process(output, input);
51+
// `output` now contains resampled version of input
52+
}
53+
```
54+
55+
[See also a gallery with results of applying various SRC presets](src_gallery.md)

Diff for: ‎docs/mkdocs.yml

+1
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ nav:
7474
- expressions.md
7575
- capi.md
7676
- upgrade6.md
77+
- benchmarking.md
7778
- DSP:
7879
- fir.md
7980
- bq.md

Diff for: ‎include/kfr/dft/fft.hpp

+155-18
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ namespace kfr
5454
using cdirect_t = cfalse_t;
5555
using cinvert_t = ctrue_t;
5656

57+
/// @brief Internal structure representing a single DFT stage
5758
template <typename T>
5859
struct dft_stage
5960
{
@@ -106,16 +107,29 @@ enum class dft_type
106107
inverse
107108
};
108109

110+
/**
111+
* @brief Specifies the desired order for DFT output (and IDFT input)
112+
*
113+
* Currenly ignored.
114+
*/
109115
enum class dft_order
110116
{
111-
normal,
117+
normal, // Normal order
112118
internal, // possibly bit/digit-reversed, implementation-defined, may be faster to compute
113119
};
114120

121+
/**
122+
* @brief Specifies the packing format for real DFT output data.
123+
* See https://www.kfr.dev/docs/latest/dft_format/ for details
124+
*/
115125
enum class dft_pack_format
116126
{
117-
Perm, // {X[0].r, X[N].r}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}
118-
CCs // {X[0].r, 0}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}, {X[N].r, 0}
127+
/// Packed format: {X[0].r, X[N].r}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}
128+
/// Number of complex samples is $\frac{N}{2}$ where N is the number of real samples
129+
Perm,
130+
/// Conjugate-symmetric format: {X[0].r, 0}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}, {X[N].r, 0}
131+
/// Number of complex samples is $\frac{N}{2}+1$ where N is the number of real samples
132+
CCs,
119133
};
120134

121135
template <typename T>
@@ -124,9 +138,6 @@ struct dft_plan;
124138
template <typename T>
125139
struct dft_plan_real;
126140

127-
template <typename T>
128-
struct dft_stage;
129-
130141
template <typename T>
131142
using dft_stage_ptr = std::unique_ptr<dft_stage<T>>;
132143

@@ -146,23 +157,66 @@ void dft_initialize_transpose(fn_transpose<T>& transpose);
146157

147158
} // namespace internal_generic
148159

149-
/// @brief 1D DFT/FFT
160+
/**
161+
* @brief Class for performing 1D DFT/FFT.
162+
*
163+
* The same plan is used for both direct DFT and inverse DFT. The type is default-constructible and movable
164+
* but non-copyable. It is advisable to create an instance of the `dft_plan` with a specific size
165+
* beforehand and reuse this instance in all subsequent DFT operations.
166+
*
167+
* @tparam T Template parameter specifying the floating-point type. Must be either `float` or `double`;
168+
* other types are not supported.
169+
*/
150170
template <typename T>
151171
struct dft_plan
152172
{
173+
/// The size of the DFT as passed to the contructor.
153174
size_t size;
175+
176+
/// The temporary (scratch) buffer size for the DFT plan.
177+
/// @note Preallocating a byte buffer of this size and passing its pointer to the
178+
/// `execute` function may improve performance.
154179
size_t temp_size;
155180

181+
/**
182+
* @brief Constructs an empty DFT plan.
183+
*
184+
* This default constructor ensures the type is default-constructible.
185+
*/
156186
dft_plan()
157187
: size(0), temp_size(0), data_size(0), arblen(false), disposition_inplace{}, disposition_outofplace{}
158188
{
159189
}
160190

161-
dft_plan(const dft_plan&) = delete;
162-
dft_plan(dft_plan&&) = default;
191+
/**
192+
* @brief Copy constructor (deleted).
193+
*
194+
* Copying of `dft_plan` instances is not allowed.
195+
*/
196+
dft_plan(const dft_plan&) = delete;
197+
198+
/**
199+
* @brief Copy assignment operator (deleted).
200+
*
201+
* Copy assignment of `dft_plan` instances is not allowed.
202+
*/
163203
dft_plan& operator=(const dft_plan&) = delete;
164-
dft_plan& operator=(dft_plan&&) = default;
165204

205+
/**
206+
* @brief Move constructor.
207+
*/
208+
dft_plan(dft_plan&&) = default;
209+
210+
/**
211+
* @brief Move assignment operator.
212+
*/
213+
dft_plan& operator=(dft_plan&&) = default;
214+
215+
/**
216+
* @brief Checks whether the plan is non-empty.
217+
*
218+
* @return `true` if the plan was constructed with a specific DFT size, `false` otherwise.
219+
*/
166220
bool is_initialized() const { return size != 0; }
167221

168222
[[deprecated("cpu parameter is deprecated. Runtime dispatch is used if built with "
@@ -172,14 +226,36 @@ struct dft_plan
172226
{
173227
(void)cpu;
174228
}
229+
230+
/**
231+
* @brief Constructs a DFT plan with the specified size and order.
232+
*
233+
* @param size The size of the DFT.
234+
* @param order The order of the DFT samples. See `dft_order`.
235+
*/
175236
explicit dft_plan(size_t size, dft_order order = dft_order::normal)
176237
: size(size), temp_size(0), data_size(0), arblen(false)
177238
{
178239
internal_generic::dft_initialize(*this);
179240
}
180241

242+
/**
243+
* @brief Dumps details of the DFT plan to stdout for inspection.
244+
*
245+
* May be used to determine the selected architecture at runtime and the chosen DFT algorithms.
246+
*/
181247
void dump() const;
182248

249+
/**
250+
* @brief Execute the complex DFT on `in` and write the result to `out`.
251+
* @param out Pointer to the output data.
252+
* @param in Pointer to the input data.
253+
* @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
254+
* `plan->temp_size` will be allocated on stack or heap.
255+
* @param inverse If true, apply the inverse DFT.
256+
* @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
257+
* values to `out`, where $N$ is the size passed to the constructor.
258+
*/
183259
KFR_MEM_INTRINSIC void execute(complex<T>* out, const complex<T>* in, u8* temp,
184260
bool inverse = false) const
185261
{
@@ -188,14 +264,41 @@ struct dft_plan
188264
else
189265
execute_dft(cfalse, out, in, temp);
190266
}
267+
268+
/**
269+
* @brief Destructor.
270+
*
271+
* Deallocates internal data.
272+
*/
191273
~dft_plan() {}
274+
275+
/**
276+
* @brief Execute the complex DFT on `in` and write the result to `out`.
277+
* @param out Pointer to the output data.
278+
* @param in Pointer to the input data.
279+
* @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
280+
* `plan->temp_size` will be allocated on stack or heap.
281+
* @tparam inverse If true, apply the inverse DFT.
282+
* @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
283+
* values to `out`, where $N$ is the size passed to the constructor.
284+
*/
192285
template <bool inverse>
193286
KFR_MEM_INTRINSIC void execute(complex<T>* out, const complex<T>* in, u8* temp,
194287
cbool_t<inverse> inv) const
195288
{
196289
execute_dft(inv, out, in, temp);
197290
}
198291

292+
/**
293+
* @brief Execute the complex DFT on `in` and write the result to `out`.
294+
* @param out Pointer to the output data.
295+
* @param in Pointer to the input data.
296+
* @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
297+
* `plan->temp_size` will be allocated on stack or heap.
298+
* @param inverse If true, apply the inverse DFT.
299+
* @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
300+
* values to `out`, where $N$ is the size passed to the constructor.
301+
*/
199302
template <univector_tag Tag1, univector_tag Tag2, univector_tag Tag3>
200303
KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
201304
univector<u8, Tag3>& temp, bool inverse = false) const
@@ -205,13 +308,34 @@ struct dft_plan
205308
else
206309
execute_dft(cfalse, out.data(), in.data(), temp.data());
207310
}
311+
312+
/**
313+
* @brief Execute the complex DFT on `in` and write the result to `out`.
314+
* @param out Pointer to the output data.
315+
* @param in Pointer to the input data.
316+
* @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
317+
* `plan->temp_size` will be allocated on stack or heap.
318+
* @tparam inverse If true, apply the inverse DFT.
319+
* @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
320+
* values to `out`, where $N$ is the size passed to the constructor.
321+
*/
208322
template <bool inverse, univector_tag Tag1, univector_tag Tag2, univector_tag Tag3>
209323
KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
210324
univector<u8, Tag3>& temp, cbool_t<inverse> inv) const
211325
{
212326
execute_dft(inv, out.data(), in.data(), temp.data());
213327
}
214328

329+
/**
330+
* @brief Execute the complex DFT on `in` and write the result to `out`.
331+
* @param out Pointer to the output data.
332+
* @param in Pointer to the input data.
333+
* @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
334+
* `plan->temp_size` will be allocated on stack or heap.
335+
* @param inverse If true, apply the inverse DFT.
336+
* @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
337+
* values to `out`, where $N$ is the size passed to the constructor.
338+
*/
215339
template <univector_tag Tag1, univector_tag Tag2>
216340
KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
217341
u8* temp, bool inverse = false) const
@@ -221,25 +345,38 @@ struct dft_plan
221345
else
222346
execute_dft(cfalse, out.data(), in.data(), temp);
223347
}
348+
349+
/**
350+
* @brief Execute the complex DFT on `in` and write the result to `out`.
351+
* @param out Pointer to the output data.
352+
* @param in Pointer to the input data.
353+
* @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
354+
* `plan->temp_size` will be allocated on stack or heap.
355+
* @tparam inverse If true, apply the inverse DFT.
356+
* @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
357+
* values to `out`, where $N$ is the size passed to the constructor.
358+
*/
224359
template <bool inverse, univector_tag Tag1, univector_tag Tag2>
225360
KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
226361
u8* temp, cbool_t<inverse> inv) const
227362
{
228363
execute_dft(inv, out.data(), in.data(), temp);
229364
}
230365

231-
autofree<u8> data;
232-
size_t data_size;
366+
autofree<u8> data; /**< Internal data. */
367+
size_t data_size; /**< Internal data size. */
233368

234-
std::vector<dft_stage_ptr<T>> all_stages;
235-
std::array<std::vector<dft_stage<T>*>, 2> stages;
236-
bool arblen;
237-
using bitset = std::bitset<DFT_MAX_STAGES>;
238-
std::array<bitset, 2> disposition_inplace;
239-
std::array<bitset, 2> disposition_outofplace;
369+
std::vector<dft_stage_ptr<T>> all_stages; /**< Internal data. */
370+
std::array<std::vector<dft_stage<T>*>, 2> stages; /**< Internal data. */
371+
bool arblen; /**< True if Bluestein's FFT algorithm is selected. */
372+
using bitset = std::bitset<DFT_MAX_STAGES>; /**< Internal typedef. */
373+
std::array<bitset, 2> disposition_inplace; /**< Internal data. */
374+
std::array<bitset, 2> disposition_outofplace; /**< Internal data. */
240375

376+
/// Internal function
241377
void calc_disposition();
242378

379+
/// Internal function
243380
static bitset precompute_disposition(int num_stages, bitset can_inplace_per_stage,
244381
bool inplace_requested);
245382

0 commit comments

Comments
 (0)