reduction
Implements GPU reduction algorithms for parallel data aggregation.
Functions
-
block_reduce: Performs a block-level reduction of a single SIMD value across all threads in a GPU thread block using warp-level primitives and shared memory. -
reduce_kernel: GPU kernel that reduces rows along a given axis. Each block reduces one row at a time usingrow_reduceand writes the result viaoutput_fn. Uses a grid-stride loop to handle more rows than blocks. -
reduce_launch: Selects and launches the appropriate GPU reduction kernel based on the tensor shape, axis, and device saturation level. -
row_reduce: Reduces a single row along the given axis using block-level cooperative reduction. Delegates to the multi-reductionrow_reduceoverload withnum_reductions=1. -
saturated_reduce_kernel: GPU kernel for reductions when the device is saturated with enough rows. Each thread independently reduces an entire row using SIMD packing, avoiding shared-memory synchronization entirely. Used when reducing along a non-contiguous axis. -
small_reduce_kernel: GPU kernel optimized for rows smaller than the warp size. Each warp reduces an entire row independently, allowing multiple rows to be reduced per block without shared-memory synchronization. -
twophase_reduce_kernel: GPU kernel for reductions when there are too few rows to saturate the device at one block per row. Assigns multiple blocks per row and uses a two-phase approach: each block reduces a chunk via cooperative block-level reduction, then the last block to finish (detected via a per-row atomic counter) reduces all partial results for its row.