Version: Nightly

For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

reduce_launch

def reduce_launch[num_reductions: Int, input_fn: def[dtype: DType, width: Int, rank: Int](IndexList[rank]) capturing -> SIMD[dtype, width], output_fn: def[dtype: DType, width: SIMDSize, rank: Int](IndexList[rank], StaticTuple[SIMD[dtype, width], num_reductions]) capturing -> None, reduce_fn: def[ty: DType, width: SIMDSize, reduction_idx: Int](SIMD[ty, width], SIMD[ty, width]) capturing -> SIMD[ty, width], rank: Int, dtype: DType, *, reduce_dim: Int](shape: IndexList[rank], init: StaticTuple[Scalar[dtype], num_reductions], ctx: DeviceContext)

Selects and launches the appropriate GPU reduction kernel based on the tensor shape, axis, and device saturation level.

Three-tier dispatch:

Thread-saturated (many rows, non-contiguous axis): one row per thread via saturated_reduce_kernel.
Block-saturated (enough rows to fill SMs at one block per row): reduce_kernel or small_reduce_kernel.
Under-saturated (too few rows to fill the device): multiple blocks per row via twophase_reduce_kernel with a two-phase atomic finish.

Parameters:

num_reductions (Int): The number of fused reductions to perform.
input_fn (def[dtype: DType, width: Int, rank: Int](IndexList[rank]) capturing -> SIMD[dtype, width]): The lambda to load input elements.
output_fn (def[dtype: DType, width: SIMDSize, rank: Int](IndexList[rank], StaticTuple[SIMD[dtype, width], num_reductions]) capturing -> None): The lambda to store output elements.
reduce_fn (def[ty: DType, width: SIMDSize, reduction_idx: Int](SIMD[ty, width], SIMD[ty, width]) capturing -> SIMD[ty, width]): The binary reduction function.
rank (Int): The tensor rank.
dtype (DType): The data type of the elements.
reduce_dim (Int): The axis along which to reduce.

Args:

shape (IndexList[rank]): The shape of the input tensor.
init (StaticTuple[Scalar[dtype], num_reductions]): The identity values for each reduction.
ctx (DeviceContext): The device context for GPU execution.

Raises:

If the GPU kernel launch fails.