Version: 1.0.0b2

For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

reduction

Implements GPU reduction algorithms for parallel data aggregation.

Functions

block_reduce: Performs a block-level reduction of a single SIMD value across all threads in a GPU thread block using warp-level primitives and shared memory.
reduce_kernel: GPU kernel that reduces rows along a given axis. Each block reduces one row at a time using row_reduce and writes the result via output_fn. Uses a grid-stride loop to handle more rows than blocks.
reduce_launch: Selects and launches the appropriate GPU reduction kernel based on the tensor shape, axis, and device saturation level.
row_reduce: Reduces a single row along the given axis using block-level cooperative reduction. Delegates to the multi-reduction row_reduce overload with num_reductions=1.
saturated_reduce_kernel: GPU kernel for reductions when the device is saturated with enough rows. Each thread independently reduces an entire row using SIMD packing, avoiding shared-memory synchronization entirely. Used when reducing along a non-contiguous axis.
small_reduce_kernel: GPU kernel optimized for rows smaller than the warp size. Each warp reduces an entire row independently, allowing multiple rows to be reduced per block without shared-memory synchronization.
twophase_reduce_kernel: GPU kernel for reductions when there are too few rows to saturate the device at one block per row. Assigns multiple blocks per row and uses a two-phase approach: each block reduces a chunk via cooperative block-level reduction, then the last block to finish (detected via a per-row atomic counter) reduces all partial results for its row.

Functions​

Functions