IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content
Version: 1.0.0b1
For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

cp_async_bulk_tensor_2d_gather4

cp_async_bulk_tensor_2d_gather4[dst_type: AnyType, mbr_type: AnyType, /, *, cta_group: Int = 1, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL](dst_mem: UnsafePointer[dst_type, address_space=AddressSpace.SHARED], tma_descriptor: UnsafePointer[NoneType], mem_bar: UnsafePointer[mbr_type, address_space=AddressSpace.SHARED], col_idx: Int32, row0: Int32, row1: Int32, row2: Int32, row3: Int32)

Initiates an asynchronous gather4 copy of 4 non-contiguous rows from a 2D tensor in global memory into shared memory using TMA.

This is a Blackwell (SM100+) instruction that loads 4 rows at arbitrary row indices from a 2D tensor map, placing them contiguously in shared memory. The tensor map must be created with box dim1=1 (one row per tile). Each row is a full tile along the column dimension.

PTX: cp.async.bulk.tensor.2d.shared::cta.global.tile::gather4 .mbarrier::complete_tx::bytes

Constraints:

Requires SM100 (Blackwell) or newer GPU architecture.

Parameters:

  • dst_type (AnyType): The data type of the destination memory.
  • mbr_type (AnyType): The data type of the memory barrier.
  • cta_group (Int): The CTA group for the copy operation. Must be 1 or 2.
  • eviction_policy (CacheEviction): Cache eviction policy. Defaults to EVICT_NORMAL.

Args: