IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content
Version: 1.0.0b1
For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

cp_async_bulk_reduce_global_shared_cta

cp_async_bulk_reduce_global_shared_cta[dtype: DType, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL](dst_mem: UnsafePointer[Scalar[dtype], address_space=dst_mem.address_space], src_mem: UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED], size: Int32)

Initiates an asynchronous bulk reduction from shared CTA memory into global memory.

Performs a non-blocking element-wise reduction of size bytes of shared memory into the matching locations in global memory, using the PTX cp.reduce.async.bulk instruction with the .bulk_group completion mechanism. Use cp_async_bulk_commit_group and cp_async_bulk_wait_group from std.gpu.sync to synchronize.

Both dst_mem and src_mem must be 16-byte aligned, and size must be a multiple of 16. Requires sm_100 or higher.

Parameters:

  • dtype (DType): Element data type of the reduction. Supported floating-point types are float16, bfloat16, float32, and float64.
  • reduction_kind (ReduceOp): The reduction operation to apply. Curently only ADD is supported.
  • eviction_policy (CacheEviction): Cache eviction policy for the L2 cache. Defaults to EVICT_NORMAL.

Args: