For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
GenericToSharedAsyncTileCopier
struct GenericToSharedAsyncTileCopier[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], *, swizzle: Optional[Swizzle] = None, masked: Bool = False, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL, num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope.BLOCK]
An AsyncTileCopier that asynchronously moves a tile from generic memory into shared memory using NVIDIA's cp.async instruction.
On NVIDIA GPUs (compute capability 8.0+), the copy issues cp.async
instructions, allowing the transfer to overlap with subsequent compute.
On AMD and Apple GPUs the underlying async_copy intrinsic falls back
to synchronous loads and stores.
The copy is asynchronous on NVIDIA: callers must commit it via
async_copy_commit_group() and synchronize via async_copy_wait_all()
or async_copy_wait_group() before reading the destination tile.
The vector size in bytes (size_of[dtype]() * element_size) must be
4, 8, or 16.
Parameters
- thread_layout (
Layout[thread_layout.shape_types, thread_layout.stride_types]): Layout describing how threads are organized over the copy. - swizzle (
Optional[Swizzle]): Optional swizzle applied to the shared-memory destination for bank-conflict mitigation.Noneproduces a straight copy. Subsequent readers of the tile must use the same swizzle. - masked (
Bool): WhenTrue, performs per-vector bounds-checking againstsrc.dim[0]() * row_stride. Vectors that fall past the bound issue zero-bytecp.asyncoperations withfill=0, which the hardware fulfills by zeroing the destination bytes. Intended for source tiles whose row count is dynamic (e.g. attention prefill loading the tail of a sequence). - eviction_policy (
CacheEviction): Cache eviction policy for the source data. - num_threads (
Int): Total number of threads in the thread block. Threads beyondthread_layout.size()do not participate. - thread_scope (
ThreadScope): Scope at which thread operations are performed (BLOCKorWARP). Defaults toThreadScope.BLOCK.
Implemented traits
AnyType,
AsyncTileCopier,
Copyable,
ImplicitlyCopyable,
ImplicitlyDeletable,
Movable
comptime members
dst_address_space
comptime dst_address_space = AddressSpace.SHARED
Destination AddressSpace this copier writes to.
src_address_space
comptime src_address_space = AddressSpace.GENERIC
Source AddressSpace this copier reads from.
Methods
copy
def copy[element_size: Int](self, dst: TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type, element_size=element_size], src: TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type, element_size=element_size])
Asynchronously copies src in generic memory into dst in shared memory.
The copy is issued via cp.async on NVIDIA. Callers must commit
and wait on the copy before using the destination tile.
This satisfies the AsyncTileCopier trait; the masked bound is
derived from src.dim[0](). For an explicit-bound copy (a src
whose row dim is static), call copy_bounded directly.
Parameters:
- element_size (
Int): Number of scalar elements per logical element.
Args:
- dst (
TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type, element_size=element_size]): Destination tile in shared memory. - src (
TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type, element_size=element_size]): Source tile in generic memory.
copy_bounded
def copy_bounded[element_size: Int](self, dst: TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type, element_size=element_size], src: TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type, element_size=element_size], src_num_valid_rows: OptionalReg[Int])
Asynchronously copies src into dst with an optional explicit masked-bound override.
Identical to copy except for the masked-bound source. This is NOT a
trait method (the AsyncTileCopier trait fixes the copy signature);
it is the explicit-bound entry point.
Parameters:
- element_size (
Int): Number of scalar elements per logical element.
Args:
- dst (
TileTensor[Storage=dst.Storage, address_space=Self.dst_address_space, linear_idx_type=dst.linear_idx_type, element_size=element_size]): Destination tile in shared memory. - src (
TileTensor[Storage=src.Storage, linear_idx_type=src.linear_idx_type, element_size=element_size]): Source tile in generic memory. - src_num_valid_rows (
OptionalReg[Int]): Explicit valid-row count for the masked bound. WhenNone, the masked bound is derived fromsrc.dim[0]()(byte-identical to the legacy behavior). When provided, it overridessrc.dim[0]()in thesrc_idx_bound = rows * row_stride - src_frag_offsetcomputation; everything else is unchanged. This lets callers whosesrccarries a static row dim (e.g. aTileTensor.tilesub-view, which does not runtime-clip dim0) still drive a correct partial-tile zero-fill by passing the runtime clip directly. Only consulted whenmaskedisTrue.