copy_local_to_local
copy_local_to_local(dst: LayoutTensor[dst.dtype, dst.layout, dst.origin, address_space=dst.address_space, element_layout=dst.element_layout, layout_int_type=dst.layout_int_type, linear_idx_type=dst.linear_idx_type, masked=dst.masked, alignment=dst.alignment], src: LayoutTensor[src.dtype, src.layout, src.origin, address_space=src.address_space, element_layout=src.element_layout, layout_int_type=src.layout_int_type, linear_idx_type=src.linear_idx_type, masked=src.masked, alignment=src.alignment])
Synchronously copy data between local memory (register) tensors with type conversion.
This function performs a synchronous copy operation between register tensors in a GPU context, with support for converting from float32 to half-precision formats (bfloat16/float16). It's particularly optimized for specific tensor layouts commonly used in matrix multiplication operations.
Example:
from layout import LayoutTensor, Layout
from layout.layout_tensor import copy_local_to_local
def kernel():
...
var src_reg = LayoutTensor[DType.float32,
Layout.row_major(16, 8),
MutAnyOrigin,
address_space = AddressSpace.LOCAL,
].stack_allocation().fill(1)
var dst_reg = LayoutTensor[DType.bfloat16,
Layout.row_major(16, 8),
MutAnyOrigin,
address_space = AddressSpace.LOCAL,
].stack_allocation()
# Process data in float32 registers
# ...
# Convert and copy to bfloat16 registers
copy_local_to_local(dst_reg, src_reg)
Performance:
- Optimized for specific 2D tensor layouts with contiguous inner dimensions.
- Special fast path for 2D tensors with specific layouts used in matrix multiplication.
- For MMA (Matrix Multiply-Accumulate) operations, efficiently handles the conversion between output fragments and input fragments with different layouts.
- Falls back to element-wise copy for general cases.
Notes:
- Both source and destination tensors must be in
LOCALaddress space (registers). - This function currently only supports copying from float32 to half-precision formats.
- For 2D tensors with stride[1] == 1, a specialized fast path is used that's optimized for matrix multiplication patterns.
- This function is particularly useful in GPU kernels for converting between different precision formats while keeping data in registers.
Constraints:
- Destination tensor must be in
LOCALaddress space. - Source tensor must be in
LOCALaddress space. - Destination tensor must have a half-precision floating-point data type.
- Source tensor must have float32 data type.
- Both tensors must have the same total size.
Args:
- dst (
LayoutTensor): The destination tensor, which must be in local memory (registers) and have a half-precision floating-point data type (bfloat16 or float16). - src (
LayoutTensor): The source tensor, which must be in local memory (registers) and have float32 data type.