For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
mma
mma[block_size: Int = 1](mut d: SIMD, a: SIMD, b: SIMD, c: SIMD)
Performs warp sync Tensor Core based Matrix-multiply and accumulate (MMA) operation.
This function executes a matrix multiply-accumulate operation using GPU Tensor Cores, synchronizing across the warp. It dispatches to architecture-specific implementations for NVIDIA and AMD GPUs.
The operation performed is: d = (a * b) + c
Supported configurations depend on the GPU architecture:
- NVIDIA: Various combinations of FP32, FP16, BF16, and FP8 formats
- AMD: Limited subset of FP32 and FP16 operations
Note:
- All threads in a warp must execute this operation together
- Input matrices must be properly loaded and formatted for Tensor Core operations
- Matrix dimensions and data types must match hardware requirements
Parameters:
- block_size (
Int): The size of the block of the MMA operation (e.g., 4x4x4_16B). Applies to AMD GPUs only.
Args: