Skip to main content
Version: 1.0

reduction

Implements GPU reduction algorithms for parallel data aggregation.

Functions

  • block_reduce: Performs a block-level reduction of a single SIMD value across all threads in a GPU thread block using warp-level primitives and shared memory.
  • reduce_kernel: GPU kernel that reduces rows along a given axis. Each block reduces one row at a time using row_reduce and writes the result via output_fn. Uses a grid-stride loop to handle more rows than blocks.
  • reduce_launch: Selects and launches the appropriate GPU reduction kernel based on the tensor shape, axis, and device saturation level.
  • row_reduce: Reduces a single row along the given axis using block-level cooperative reduction. Delegates to the multi-reduction row_reduce overload with num_reductions=1.
  • saturated_reduce_kernel: GPU kernel for reductions when the device is saturated with enough rows. Each thread independently reduces an entire row using SIMD packing, avoiding shared-memory synchronization entirely. Used when reducing along a non-contiguous axis.
  • small_reduce_kernel: GPU kernel optimized for rows smaller than the warp size. Each warp reduces an entire row independently, allowing multiple rows to be reduced per block without shared-memory synchronization.
  • twophase_reduce_kernel: GPU kernel for reductions when there are too few rows to saturate the device at one block per row. Assigns multiple blocks per row and uses a two-phase approach: each block reduces a chunk via cooperative block-level reduction, then the last block to finish (detected via a per-row atomic counter) reduces all partial results for its row.