Version: Nightly

For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

GenericToSharedAsyncTileCopier

struct GenericToSharedAsyncTileCopier[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], *, swizzle: Optional[Swizzle] = None, masked: Bool = False, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL, num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope.BLOCK]

An AsyncTileCopier that asynchronously moves a tile from generic memory into shared memory using NVIDIA's cp.async instruction.

On NVIDIA GPUs (compute capability 8.0+), the copy issues cp.async instructions, allowing the transfer to overlap with subsequent compute. On AMD and Apple GPUs the underlying async_copy intrinsic falls back to synchronous loads and stores.

The copy is asynchronous on NVIDIA: callers must commit it via async_copy_commit_group() and synchronize via async_copy_wait_all() or async_copy_wait_group() before reading the destination tile.

The vector size in bytes (size_of[dtype]() * element_size) must be 4, 8, or 16.

Parameters

thread_layout (Layout[thread_layout.shape_types, thread_layout.stride_types]): Layout describing how threads are organized over the copy.
swizzle (Optional[Swizzle]): Optional swizzle applied to the shared-memory destination for bank-conflict mitigation. None produces a straight copy. Subsequent readers of the tile must use the same swizzle.
masked (Bool): When True, performs per-vector bounds-checking against src.dim[0]() * row_stride. Vectors that fall past the bound issue zero-byte cp.async operations with fill=0, which the hardware fulfills by zeroing the destination bytes. Intended for source tiles whose row count is dynamic (e.g. attention prefill loading the tail of a sequence).
eviction_policy (CacheEviction): Cache eviction policy for the source data.
num_threads (Int): Total number of threads in the thread block. Threads beyond thread_layout.size() do not participate.
thread_scope (ThreadScope): Scope at which thread operations are performed (BLOCK or WARP). Defaults to ThreadScope.BLOCK.

Implemented traits

AnyType, AsyncTileCopier, Copyable, ImplicitlyCopyable, ImplicitlyDeletable, Movable

`comptime` members

`dst_address_space`

comptime dst_address_space = AddressSpace.SHARED

Destination AddressSpace this copier writes to.

`src_address_space`

comptime src_address_space = AddressSpace.GENERIC

Source AddressSpace this copier reads from.

Methods

`copy`

def copy(self, dst: TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type], src: TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type])

Asynchronously copies src in generic memory into dst in shared memory.

The copy is issued via cp.async on NVIDIA. Callers must commit and wait on the copy before using the destination tile.

This satisfies the AsyncTileCopier trait; the masked bound is derived from src.dim[0](). For an explicit-bound copy (a src whose row dim is static), call copy_bounded directly.

Args:

dst (TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type]): Destination tile in shared memory.
src (TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type]): Source tile in generic memory.

`copy_bounded`

def copy_bounded(self, dst: TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type], src: TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type], src_num_valid_rows: OptionalReg[Int])

Asynchronously copies src into dst with an optional explicit masked-bound override.

Identical to copy except for the masked-bound source. This is NOT a trait method (the AsyncTileCopier trait fixes the copy signature); it is the explicit-bound entry point.

Args:

dst (TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type]): Destination tile in shared memory.
src (TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type]): Source tile in generic memory.
src_num_valid_rows (OptionalReg[Int]): Explicit valid-row count for the masked bound. When None, the masked bound is derived from src.dim[0]() (byte-identical to the legacy behavior). When provided, it overrides src.dim[0]() in the src_idx_bound = rows * row_stride - src_frag_offset computation; everything else is unchanged. This lets callers whose src carries a static row dim (e.g. a TileTensor.tile sub-view, which does not runtime-clip dim0) still drive a correct partial-tile zero-fill by passing the runtime clip directly. Only consulted when masked is True.

Parameters​

Implemented traits​

comptime members​

dst_address_space​

src_address_space​

Methods​

copy​

copy_bounded​