Version: 1.0.0b2

For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

copy_dram_to_sram_async

def copy_dram_to_sram_async[thread_layout: Layout[thread_layout.shape_types, thread_layout.stride_types], *, swizzle: Optional[Swizzle] = None, masked: Bool = False, eviction_policy: CacheEviction = CacheEviction.EVICT_NORMAL, num_threads: Int = thread_layout.size(), thread_scope: ThreadScope = ThreadScope.BLOCK, element_size: Int](dst: TileTensor[address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=element_size], src: TileTensor[linear_idx_type=src.linear_idx_type, element_size=element_size], src_num_valid_rows: OptionalReg[Int] = None)

Asynchronously copies a tile from DRAM (generic memory) to SRAM (shared).

Delegates to GenericToSharedAsyncTileCopier, which issues NVIDIA cp.async instructions (falling back to synchronous loads/stores on AMD and Apple GPUs). The copy is asynchronous: callers must commit it via async_copy_commit_group() and synchronize via async_copy_wait_all() or async_copy_wait_group() before reading the destination tile.

Unlike the legacy free function, whose swizzle: Bool auto-derived an ldmatrix swizzle, this wrapper takes an Optional[Swizzle] directly and does not replicate that auto-derivation; pass an explicit swizzle (for example from make_swizzle[..., access_size=element_size]) when one is required.

Parameters:

thread_layout (Layout[thread_layout.shape_types, thread_layout.stride_types]): Layout describing how threads are organized over the copy.
swizzle (Optional[Swizzle]): Optional swizzle applied to the shared-memory destination for bank-conflict mitigation. None produces a straight copy. Subsequent readers of the tile must use the same swizzle.
masked (Bool): When True, performs per-vector bounds-checking; vectors past the bound issue zero-filling cp.async operations.
eviction_policy (CacheEviction): Cache eviction policy for the source data.
num_threads (Int): Total number of threads in the thread block. Threads beyond thread_layout.size() do not participate.
thread_scope (ThreadScope): Scope at which thread operations are performed.
element_size (Int): Number of scalar elements per logical element; inferred from the source and destination tiles.

Args:

dst (TileTensor[address_space=AddressSpace.SHARED, linear_idx_type=dst.linear_idx_type, element_size=element_size]): Destination tile in shared memory.
src (TileTensor[linear_idx_type=src.linear_idx_type, element_size=element_size]): Source tile in generic memory.
src_num_valid_rows (OptionalReg[Int]): Explicit valid-row count for the masked bound. When None (default) the bound is derived from src.dim[0]() (byte-identical to legacy). When provided it overrides src.dim[0]() so a src with a static row dim (e.g. a TileTensor.tile sub-view) can still drive a correct partial-tile zero-fill. Only consulted when masked is True.