Update documentation

dancazarin · dancazarin · commit 288ffc81d6da · 2025-02-27T16:13:32.000+01:00
diff --git a/docs/docs/benchmarking.md b/docs/docs/benchmarking.md
@@ -0,0 +1,38 @@
+# Benchmarking DFT
+
+Accurate DFT benchmarking requires careful control of optimizations, CPU architecture, and compiler behavior. Follow these guidelines to ensure reliable performance measurements.
+
+> [!note]
+> A robust FFT benchmark suite implementing all these techniques is published at https://github.com/kfrlib/fft-benchmark
+
+- Ensure that the optimized version of each library is used. If the vendor provides prebuilt binaries, use them.
+  - For KFR, the official binaries can be found at: https://github.com/kfrlib/kfr/releases.
+  - To verify that KFR is optimized for maximum performance, call:
+
+  ```c++
+  library_version()
+  ```
+  Example output:
+  ```
+  KFR 6.1.1 optimized sse2 [sse2, sse41, avx, avx2, avx512] 64-bit (clang-msvc-19.1.0/windows) +in +ve
+  ```
+  The output must include the `optimized` flag and must not contain the `debug` flag.
+
+- For libraries that support dynamic CPU dispatch, ensure that the best available architecture for your CPU is selected at runtime. Refer to the library documentation to learn how to verify this.
+  - For KFR, call:
+
+  ```c++
+  cpu_runtime()
+  ```
+  This function returns the selected architecture, such as `avx2`, `avx512`, or `neon`/`neon64` (for ARM).
+
+- Ensure that no emulation is involved. For example, use native `arm64` binaries for Apple M-series CPUs.
+
+- Exclude plan creation from the time measurements.
+
+- Ensure that the compiler does not optimize out the DFT code. Add code that utilizes the output data in some way to prevent the compiler from optimizing away the computation.
+
+- Perform multiple invocations to obtain reliable results. A few seconds of execution time is the minimum requirement for accurate measurements.
+
+- Use the median or minimum of all measured execution times rather than the simple mean, as this better protects against unexpected spikes and benchmarking noise.
+
diff --git a/docs/docs/src.md b/docs/docs/src.md
@@ -1,4 +1,55 @@
 # How to do Sample Rate Conversion
 
+## How to apply a Sample Rate Conversion to a contiguous signal?
 
-[See also a gallery with results of applying various SRC presets](src_gallery.md)
+For a continuous signal, the same instance of the `samplerate_converter` class should be used across all subsequent calls, rather than creating a new instance for each fragment. In the case of stereo audio, two instances (one per channel) are required.
+
+The `samplerate_converter` class supports both `push` and `pull` methods for handling data flow.  
+
+- **`push`**: Input data of a fixed size is provided, and all available output data is received.  
+  **Example**: Processing audio from a microphone, where the sound device sends data in fixed-size chunks.  
+
+- **`pull`**: An output buffer of a fixed size is provided, and the necessary amount of input data is processed to generate the required output.  
+  **Example**: Streaming audio at a different sample rate to a sound device, where a specific number of output samples must be generated to fill the device buffer.
+
+Let’s consider the case of resampling 44.1 kHz to 96 kHz with an output buffer of 512 samples (`pull`).  
+The corresponding input size should be 235.2, which is not an integer.
+
+The `samplerate_converter` class processes signals split into buffers of different sizes by maintaining an internal state.
+
+To determine the required input buffer size for the next call to `process`, `input_size_for_output` can be used by passing the desired output buffer length. This will return either 236 or 235 samples in the 44.1khz to 96khz scenario.
+
+The `process` function accepts two parameters:  
+- `output`: Output buffer, provided as a univector with the desired size (512).  
+- `input`: Input buffer, provided as a univector of at least the size returned by `input_size_for_output`. The resampler consumes the necessary number of samples to generate 512 output samples and returns the number of input samples read. The input should be adjusted accordingly to skip these samples.
+
+For the `push` method, call `output_size_for_input` with the size of your input buffer. This function returns the corresponding output buffer size required to receive all pending output.
+
+### Example (pull)
+
+```c++
+// Initialization
+auto src = samplerate_converter<double>(sample_rate_conversion_quality::high, output_samplerate, input_samplerate);
+
+void process_chunk(univector_ref<double> output) {
+    univector<double> input(src.input_size_for_output(output.size()));
+    // fill `input` with input samples
+    src.process(output, input);
+    // `output` now contains resampled version of input
+}
+```
+
+### Example (push)
+
+```c++
+// Initialization
+auto src = samplerate_converter<double>(resample_quality::high, output_sr, input_sr);
+
+void process_chunk(univector_ref<const double> input) {
+    univector<double> output(src.output_size_for_input(input.size()));
+    src.process(output, input);
+    // `output` now contains resampled version of input
+}
+```
+
+[See also a gallery with results of applying various SRC presets](src_gallery.md)
diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml
@@ -74,6 +74,7 @@ nav:
       - expressions.md
       - capi.md
       - upgrade6.md
+      - benchmarking.md
   - DSP:
       - fir.md
       - bq.md
diff --git a/include/kfr/dft/fft.hpp b/include/kfr/dft/fft.hpp
@@ -54,6 +54,7 @@ namespace kfr
 using cdirect_t = cfalse_t;
 using cinvert_t = ctrue_t;
 
+/// @brief Internal structure representing a single DFT stage
 template <typename T>
 struct dft_stage
 {
@@ -106,16 +107,29 @@ enum class dft_type
     inverse
 };
 
+/**
+ * @brief Specifies the desired order for DFT output (and IDFT input)
+ *
+ * Currenly ignored.
+ */
 enum class dft_order
 {
-    normal,
+    normal, // Normal order
     internal, // possibly bit/digit-reversed, implementation-defined, may be faster to compute
 };
 
+/**
+ * @brief Specifies the packing format for real DFT output data.
+ * See https://www.kfr.dev/docs/latest/dft_format/ for details
+ */
 enum class dft_pack_format
 {
-    Perm, // {X[0].r, X[N].r}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}
-    CCs // {X[0].r, 0}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i},  {X[N].r, 0}
+    /// Packed format: {X[0].r, X[N].r}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}
+    /// Number of complex samples is $\frac{N}{2}$ where N is the number of real samples
+    Perm,
+    /// Conjugate-symmetric format: {X[0].r, 0}, ... {X[i].r, X[i].i}, ... {X[N-1].r, X[N-1].i}, {X[N].r, 0}
+    /// Number of complex samples is $\frac{N}{2}+1$ where N is the number of real samples
+    CCs,
 };
 
 template <typename T>
@@ -124,9 +138,6 @@ struct dft_plan;
 template <typename T>
 struct dft_plan_real;
 
-template <typename T>
-struct dft_stage;
-
 template <typename T>
 using dft_stage_ptr = std::unique_ptr<dft_stage<T>>;
 
@@ -146,23 +157,66 @@ void dft_initialize_transpose(fn_transpose<T>& transpose);
 
 } // namespace internal_generic
 
-/// @brief 1D DFT/FFT
+/**
+ * @brief Class for performing 1D DFT/FFT.
+ *
+ * The same plan is used for both direct DFT and inverse DFT. The type is default-constructible and movable
+ * but non-copyable. It is advisable to create an instance of the `dft_plan` with a specific size
+ * beforehand and reuse this instance in all subsequent DFT operations.
+ *
+ * @tparam T Template parameter specifying the floating-point type. Must be either `float` or `double`;
+ *           other types are not supported.
+ */
 template <typename T>
 struct dft_plan
 {
+    /// The size of the DFT as passed to the contructor.
     size_t size;
+
+    /// The temporary (scratch) buffer size for the DFT plan.
+    /// @note Preallocating a byte buffer of this size and passing its pointer to the
+    /// `execute` function may improve performance.
     size_t temp_size;
 
+    /**
+     * @brief Constructs an empty DFT plan.
+     *
+     * This default constructor ensures the type is default-constructible.
+     */
     dft_plan()
         : size(0), temp_size(0), data_size(0), arblen(false), disposition_inplace{}, disposition_outofplace{}
     {
     }
 
-    dft_plan(const dft_plan&)            = delete;
-    dft_plan(dft_plan&&)                 = default;
+    /**
+     * @brief Copy constructor (deleted).
+     *
+     * Copying of `dft_plan` instances is not allowed.
+     */
+    dft_plan(const dft_plan&) = delete;
+
+    /**
+     * @brief Copy assignment operator (deleted).
+     *
+     * Copy assignment of `dft_plan` instances is not allowed.
+     */
     dft_plan& operator=(const dft_plan&) = delete;
-    dft_plan& operator=(dft_plan&&)      = default;
 
+    /**
+     * @brief Move constructor.
+     */
+    dft_plan(dft_plan&&) = default;
+
+    /**
+     * @brief Move assignment operator.
+     */
+    dft_plan& operator=(dft_plan&&) = default;
+
+    /**
+     * @brief Checks whether the plan is non-empty.
+     *
+     * @return `true` if the plan was constructed with a specific DFT size, `false` otherwise.
+     */
     bool is_initialized() const { return size != 0; }
 
     [[deprecated("cpu parameter is deprecated. Runtime dispatch is used if built with "
@@ -172,14 +226,36 @@ struct dft_plan
     {
         (void)cpu;
     }
+
+    /**
+     * @brief Constructs a DFT plan with the specified size and order.
+     *
+     * @param size The size of the DFT.
+     * @param order The order of the DFT samples. See `dft_order`.
+     */
     explicit dft_plan(size_t size, dft_order order = dft_order::normal)
         : size(size), temp_size(0), data_size(0), arblen(false)
     {
         internal_generic::dft_initialize(*this);
     }
 
+    /**
+     * @brief Dumps details of the DFT plan to stdout for inspection.
+     *
+     * May be used to determine the selected architecture at runtime and the chosen DFT algorithms.
+     */
     void dump() const;
 
+    /**
+     * @brief Execute the complex DFT on `in` and write the result to `out`.
+     * @param out Pointer to the output data.
+     * @param in Pointer to the input data.
+     * @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
+     * `plan->temp_size` will be allocated on stack or heap.
+     * @param inverse If true, apply the inverse DFT.
+     * @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
+     * values to `out`, where $N$ is the size passed to the constructor.
+     */
     KFR_MEM_INTRINSIC void execute(complex<T>* out, const complex<T>* in, u8* temp,
                                    bool inverse = false) const
     {
@@ -188,14 +264,41 @@ struct dft_plan
         else
             execute_dft(cfalse, out, in, temp);
     }
+
+    /**
+     * @brief Destructor.
+     *
+     * Deallocates internal data.
+     */
     ~dft_plan() {}
+
+    /**
+     * @brief Execute the complex DFT on `in` and write the result to `out`.
+     * @param out Pointer to the output data.
+     * @param in Pointer to the input data.
+     * @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
+     * `plan->temp_size` will be allocated on stack or heap.
+     * @tparam inverse If true, apply the inverse DFT.
+     * @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
+     * values to `out`, where $N$ is the size passed to the constructor.
+     */
     template <bool inverse>
     KFR_MEM_INTRINSIC void execute(complex<T>* out, const complex<T>* in, u8* temp,
                                    cbool_t<inverse> inv) const
     {
         execute_dft(inv, out, in, temp);
     }
 
+    /**
+     * @brief Execute the complex DFT on `in` and write the result to `out`.
+     * @param out Pointer to the output data.
+     * @param in Pointer to the input data.
+     * @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
+     * `plan->temp_size` will be allocated on stack or heap.
+     * @param inverse If true, apply the inverse DFT.
+     * @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
+     * values to `out`, where $N$ is the size passed to the constructor.
+     */
     template <univector_tag Tag1, univector_tag Tag2, univector_tag Tag3>
     KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
                                    univector<u8, Tag3>& temp, bool inverse = false) const
@@ -205,13 +308,34 @@ struct dft_plan
         else
             execute_dft(cfalse, out.data(), in.data(), temp.data());
     }
+
+    /**
+     * @brief Execute the complex DFT on `in` and write the result to `out`.
+     * @param out Pointer to the output data.
+     * @param in Pointer to the input data.
+     * @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
+     * `plan->temp_size` will be allocated on stack or heap.
+     * @tparam inverse If true, apply the inverse DFT.
+     * @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
+     * values to `out`, where $N$ is the size passed to the constructor.
+     */
     template <bool inverse, univector_tag Tag1, univector_tag Tag2, univector_tag Tag3>
     KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
                                    univector<u8, Tag3>& temp, cbool_t<inverse> inv) const
     {
         execute_dft(inv, out.data(), in.data(), temp.data());
     }
 
+    /**
+     * @brief Execute the complex DFT on `in` and write the result to `out`.
+     * @param out Pointer to the output data.
+     * @param in Pointer to the input data.
+     * @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
+     * `plan->temp_size` will be allocated on stack or heap.
+     * @param inverse If true, apply the inverse DFT.
+     * @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
+     * values to `out`, where $N$ is the size passed to the constructor.
+     */
     template <univector_tag Tag1, univector_tag Tag2>
     KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
                                    u8* temp, bool inverse = false) const
@@ -221,25 +345,38 @@ struct dft_plan
         else
             execute_dft(cfalse, out.data(), in.data(), temp);
     }
+
+    /**
+     * @brief Execute the complex DFT on `in` and write the result to `out`.
+     * @param out Pointer to the output data.
+     * @param in Pointer to the input data.
+     * @param temp Temporary (scratch) buffer. If `NULL`, scratch buffer of size
+     * `plan->temp_size` will be allocated on stack or heap.
+     * @tparam inverse If true, apply the inverse DFT.
+     * @note No scaling is applied. This function reads $N$ complex values from `in` and writes $N$ complex
+     * values to `out`, where $N$ is the size passed to the constructor.
+     */
     template <bool inverse, univector_tag Tag1, univector_tag Tag2>
     KFR_MEM_INTRINSIC void execute(univector<complex<T>, Tag1>& out, const univector<complex<T>, Tag2>& in,
                                    u8* temp, cbool_t<inverse> inv) const
     {
         execute_dft(inv, out.data(), in.data(), temp);
     }
 
-    autofree<u8> data;
-    size_t data_size;
+    autofree<u8> data; /**< Internal data. */
+    size_t data_size; /**< Internal data size. */
 
-    std::vector<dft_stage_ptr<T>> all_stages;
-    std::array<std::vector<dft_stage<T>*>, 2> stages;
-    bool arblen;
-    using bitset = std::bitset<DFT_MAX_STAGES>;
-    std::array<bitset, 2> disposition_inplace;
-    std::array<bitset, 2> disposition_outofplace;
+    std::vector<dft_stage_ptr<T>> all_stages; /**< Internal data. */
+    std::array<std::vector<dft_stage<T>*>, 2> stages; /**< Internal data. */
+    bool arblen; /**< True if Bluestein's FFT algorithm is selected. */
+    using bitset = std::bitset<DFT_MAX_STAGES>; /**< Internal typedef. */
+    std::array<bitset, 2> disposition_inplace; /**< Internal data. */
+    std::array<bitset, 2> disposition_outofplace; /**< Internal data. */
 
+    /// Internal function
     void calc_disposition();
 
+    /// Internal function
     static bitset precompute_disposition(int num_stages, bitset can_inplace_per_stage,
                                          bool inplace_requested);