For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
cp_async_bulk_reduce_global_shared_cta
cp_async_bulk_reduce_global_shared_cta[dtype: DType, /, *, reduction_kind: ReduceOp, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL](dst_mem: UnsafePointer[Scalar[dtype], address_space=dst_mem.address_space], src_mem: UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED], size: Int32)
Initiates an asynchronous bulk reduction from shared CTA memory into global memory.
Performs a non-blocking element-wise reduction of size bytes of shared
memory into the matching locations in global memory, using the PTX
cp.reduce.async.bulk instruction with the .bulk_group completion
mechanism. Use cp_async_bulk_commit_group and cp_async_bulk_wait_group
from std.gpu.sync to synchronize.
Both dst_mem and src_mem must be 16-byte aligned, and size must be a
multiple of 16. Requires sm_100 or higher.
Parameters:
- dtype (
DType): Element data type of the reduction. Supported floating-point types arefloat16,bfloat16,float32, andfloat64. - reduction_kind (
ReduceOp): The reduction operation to apply. Curently onlyADDis supported. - eviction_policy (
CacheEviction): Cache eviction policy for the L2 cache. Defaults toEVICT_NORMAL.
Args:
- dst_mem (
UnsafePointer[Scalar[dtype], address_space=dst_mem.address_space]): Destination pointer in global or generic memory (16-byte aligned). - src_mem (
UnsafePointer[Scalar[dtype], address_space=AddressSpace.SHARED]): Source pointer in shared CTA memory (16-byte aligned). - size (
Int32): Number of bytes to reduce (must be a multiple of 16).