Mojo v1.0.0b1

May 7, 2026

✨ Highlights

Documentation

Compilation targets docs instructs how to inspect your current platform, select a target configuration, and generate code for that target. Use it to build for your own system or target other CPUs, operating systems, and accelerators.
Mojo language reference covers lexical elements, expressions, statements, numeric types, struct declarations, trait declarations.
Functions reference page improves discoverability of new function features.
Split operators manual into separate pages; refreshed coverage and added tutorial, operator tests, and new reference page.
Negative examples and errors added to reference pages highlight sharp edges of the language.
MLIR reference page introduces inline MLIR to developers in Mojo code.
Adds docs for non-nullable pointers and provides sample code showing how to use Optional with UnsafePointer.
Separated the Mojo layout library docs from the MAX kernels library, reflecting that the layout library ships with the mojo package while the rest of the kernels library ships with the max package.
Added a new guide for building Mojo packages, currently covering the rattler-build workflow.
Restructured the Mojo and MAX system requirements and GPU compatibility docs. Replaced the three-tier GPU support model with a two-level "Continuously tested" / "Known compatible" taxonomy, added a dedicated Mojo GPU compatibility page with per-vendor hardware tables and driver requirements, and simplified the main requirements pages.
Added a Mojo manual reference page for the @doc_hidden decorator.

Language enhancements

Added type refinement based on compile time assumptions, enabling Mojo to narrow types from where clauses, comptime if statements, and comptime assert statements. Refinements in a scope are driven by conforms_to() expressions.

Before:

def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):
    for item in self:
        if trait_downcast[Equatable](item) == trait_downcast[Equatable](value):
            return True
    return False

After:

def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):
    for item in self:
        if item == value:
            return True
    return False

Improved diagnostics for onboarding-priority parser errors in Mojo for clarity and UX.
Updated signature error diagnostics and added related tests.
Mojo now uses NoneType instead of an empty tuple to mark constructor using literals.
The ternary if/else expression now coerces each element to its contextual type when it is obvious. For example, this works instead of producing an error about incompatible metatypes:
```
  comptime some_type: Movable = Int if cond else String
```

Unified closure improvements. This release continues the closure unification work begun in earlier releases: stateless closures auto-lift, the ref capture convention is supported, default capture conventions can be combined with explicit capture lists, and a new thin function effect distinguishes function pointer types from closure traits.

def main() raises:
    var a, b, c, d = 1, 2, 3, 4
    var x = "hello"

    # Legacy closure: no capture list. Cannot capture variables.
    def hello():
        print("hi")

    # Unified closure with no captures (stateless). Stateless closures
    # lift to top-level functions and can be passed as FFI callbacks.
    def add_one(n: Int) {} -> Int:
        return n + 1

    # Unified closure with explicit captures and a default capturing
    # convention:
    def my_fn() {mut a, b, c^, read}:
        # capture:
        # `a` by mut reference
        # `b` by immut reference
        # `c` by moving
        # `d` by immut reference (the default `read` convention)
        use(a, b, c, d)

    # Unified closure that captures `x` by ref (carries an
    # origin-mutability parameter):
    def show_x() {ref x}:
        print(x)

    # Function effects come before the capture list. The calling context
    # must handle errors raised from a `raises` closure.
    def fallible() raises {}:
        raise Error("nope")

    # Closures are invoked like ordinary functions:
    hello()
    print(add_one(41))
    my_fn()
    show_x()
    try:
        fallible()
    except e:
        print(e)

    # The `thin` function effect declares a function pointer type
    # (distinct from a closure trait). Stateless closures and top-level
    # functions satisfy `thin`:
    var fn_ptr: def(Int) thin -> Int = add_one
    print(fn_ptr(99))

Added abi("C") as a function effect for declaring C calling convention on function definitions and function pointer types. Functions marked with abi("C") use the platform C ABI (System V x86-64 / ARM64 AAPCS) for struct arguments and return values, enabling safe interop with C libraries:
```
# C-ABI function definition (safe as a callback into C code)
def add(a: Int32, b: Int32) abi("C") -> Int32:
    return a + b

# C-ABI function pointer type (safe for use with DLHandle.get_function)
var f = handle.get_function[def(Float64) abi("C") -> Float64]("sqrt")
```
DLHandle.get_function[] now enforces that the type parameter carries abi("C"), preventing silent ABI mismatches when loading C symbols.
String literals now support \uXXXX and \UXXXXXXXX unicode escape sequences, matching Python. The resulting code point is stored as UTF-8. Invalid code points and surrogates are rejected at parse time.
Added support for conditional RegisterPassable conformance.

Variadic lists and packs can be forwarded through runtime calls with *pack when the callee takes a compatible variadic list/pack.

def callee[*Ts: Writable](*args: *Ts):
    comptime for i in range(args.__len__()):
        print(args[i])

def forwarder[*Ts: Writable](*args: *Ts):
    callee(*args)

forwarder(1, "hello", 3.14)  # prints each value on a separate line

Heterogenous variadic packs can now be specified with a SomeType helper function. These two are equivalent:

def foo[*arg_types: Copyable](*args: *arg_types) -> Int: ...
def foo(*args: *SomeTypeList[Copyable]) -> Int: ...

T-strings can now be used in comptime assert messages:

  def foo[i: Int]():
      comptime assert i > 5, t"expected i > 5, got {i}"

Added __mlir_deferred_type[...] for declaring parameter-dependent return types and operation result types in inline MLIR. The parser combines the bracketed pieces into a type string and the elaborator builds the concrete MLIR type once parameters are substituted.

Language changes

Variadic parameters lists are now passed instead of ParameterList and TypeList instead of !kgen.param_list. This makes it much more ergonomic to work with these types, e.g. simple logic just works:
```
def callee[*values: Int]():
    var v = 0
    for i in range(len(values)):
        v += values[i]
    for elt in values:
        v += elt
```
Similarly, the ParameterList/TypeList structs have other methods for transforming the value list. As such, a variety of values from the Variadic struct have started moving over to being members of these types.
All Mojo functions now has a unique "function literal type". In practice, it means that:
```
# type_of(foo) != type_of(bar)

def foo(): pass
def bar(): pass
```
The fn keyword for function declarations is deprecated. Mojo now emits a compiler warning on uses of fn; this will become a compilation error in the next release. Use def instead.
The unified keyword has been removed; specify unified-closure semantics with an explicit capture list {...} after the function signature. An empty capture list {} denotes unified with no captures. Closures without any capture list are legacy. Mojo also now warns when a function pointer type omits the thin effect; specify thin explicitly to silence the warning.
Removed support for comparing tuples of differing lengths or types. Such comparisons (for example (1, 2) != (4, 5, 6)) are now rejected statically by the type system instead of silently returning not-equal.
A if comptime(C) else B now skips elaboration of the dead branch, treating the ternary expression as a compile-time evaluation contract analogous to comptime if C: A else: B.
@explicit_destroy is now rejected at parse time when paired with an unconditional ImplicitlyDestructible conformance; it remains valid only on conditional (where-clause-constrained) conformances.
Import statements of the form from pkg import ... no longer make pkg available to the module.

Library changes

Removed explicit trait_downcast/trait_downcast_var across the standard library sources, now that Mojo applies type refinement from comptime assumptions. Public APIs are unchanged. Updated files:
- stdlib/std/builtin/: _stubs.mojo, bool.mojo
- stdlib/std/collections/: deque.mojo, dict.mojo, inline_array.mojo, linked_list.mojo, list.mojo, optional.mojo, set.mojo
- stdlib/std/iter/__init__.mojo, stdlib/std/itertools/itertools.mojo
- stdlib/std/memory/: arc_pointer.mojo, owned_pointer.mojo, span.mojo
Consolidated the reflection APIs in std.reflection behind a unified entry point reflect[T]() returning a Reflected[T] handle. reflect is auto-imported via the prelude, so it is available without an explicit import. Methods on the handle replace the family of struct_field_* free functions (dropping the struct_ prefix — only structs have fields) and the get_type_name / get_base_type_name free functions:
```
struct Point:
    var x: Int
    var y: Float64

def main():
    comptime r = reflect[Point]()
    print(r.name())                          # "Point"
    print(r.field_count())                   # 2
    print(r.field_names()[0])                # x
    comptime y_type = r.field_type["y"]()    # Reflected[Float64]
    print(y_type.name())                     # "SIMD[DType.float64, 1]"
    print(reflect[List[Int]]().base_name())  # "List"
    var v: y_type.T = 3.14
```
Methods on Reflected[T]: name[qualified_builtins=], base_name, is_struct, field_count, field_names, field_types, field_index[name], field_type[name], field_offset[name=]/[index=], and field_ref[idx](s). The field_type[name]() method returns a Reflected[FieldT], so reflection is fully composable.

The legacy free functions — struct_field_count, struct_field_names, struct_field_types, struct_field_index_by_name, struct_field_type_by_name, struct_field_ref, is_struct_type, offset_of, get_type_name, get_base_type_name — and the ReflectedType[T] wrapper are now @deprecated and delegate to the new API. They will be removed in a future release.
Added struct_field_ref[idx, T](ref s) to std.reflection for accessing struct fields by index without copying. The function returns a reference with the same mutability as s and works with both concrete and generic struct types, including parametric indices in comptime for loops. The default implementations of Hashable, Equatable, and Writable now use this library function instead of the __struct_field_ref magic.
The Boolable, Defaultable, and Writable traits no longer inherit from ImplicitlyDestructible. Generic code that relied on receiving the destructor bound transitively through these traits must now spell it out explicitly, for example T: Writable & ImplicitlyDestructible.
The Variadic suite of low-level operation has been refactored and migrated to being members of the TypeList and ParameterList types, making them more ergonomic to work with and more accessible.

Atomic operations have moved to a dedicated std.atomic module. The Consistency type has been renamed to Ordering and its MONOTONIC member has been renamed to RELAXED to align with conventions used by other languages. Update existing code as follows:

# Before
from std.os import Atomic
from std.os.atomic import Atomic, Consistency, fence

_ = atom.load[ordering=Consistency.MONOTONIC]()

# After
from std.atomic import Atomic, Ordering, fence

_ = atom.load[ordering=Ordering.RELAXED]()

Added Optional.destroy_with(destroy_func), which destroys an Optional[T] in-place using a caller-provided destructor. This enables Optional to hold element types that are not ImplicitlyDestructible (for example, types marked @explicit_destroy), mirroring Variant.destroy_with. Both Optional.destroy_with and Variant.destroy_with now accept closures that capture local state in addition to plain function references. Note that Variant.destroy_with callers must now pass the destroyed type explicitly (for example, v^.destroy_with[Int](destroy_func)) since T can no longer be inferred from the closure type.
assert_raises now catches custom Writable error types, not just Error.
Added UAX #29 grapheme cluster segmentation to String and StringSlice. New APIs: graphemes() returns a GraphemeSliceIter that yields each user-perceived "character" as a StringSlice, and count_graphemes() returns the grapheme cluster count. This correctly handles combining marks, emoji ZWJ sequences, flag emoji, Hangul syllables, and other multi-codepoint clusters.
StringSlice now supports slicing by grapheme cluster via the grapheme= keyword argument, mirroring the existing byte= indexer. For example, s[grapheme=0:3] returns a StringSlice covering the first three grapheme clusters, and s[grapheme=i:i+1] extracts the i-th grapheme. Out-of-range ends are clamped to the end of the string; negative indices are not supported. Because grapheme boundaries are discovered by a forward scan, this operation is O(n) in the byte length — prefer byte= slicing when you already have byte offsets.
GraphemeSliceIter exposes a new remaining_byte_length() accessor that reports the byte length of the iterator's remaining range in O(1). This lets callers compute how many bytes the iterator has produced so far without summing per-grapheme byte lengths, and is now used by StringSlice[grapheme=...] for an ~1.4x speedup on ASCII-heavy inputs.
GraphemeSliceIter now supports reverse iteration. next_back() and peek_back() return the last grapheme cluster in the remaining range, and StringSlice.graphemes_reversed() / String.graphemes_reversed() return a GraphemeSliceIter whose for-loop iteration walks clusters from end to start. next() and next_back() can be interleaved on the same iterator. Reverse iteration costs more per cluster than forward iteration because the UAX #29 state machine is inherently forward-scanning: next_back() backs up to a guaranteed grapheme boundary (the start of the string or a Control/CR/LF codepoint) and rescans forward. The safe boundary is cached across reverse calls — a forward next() invalidates it — so per-call cost is dominated by forward-scan length: small in text containing line breaks or whitespace, growing with the distance back to such a codepoint in long runs without them.
Added grapheme-aware algorithms on String and StringSlice:
- grapheme_indices() returns a GraphemeIndicesIter that yields (byte_offset, grapheme) pairs, mirroring Rust's str::grapheme_indices. Useful for text editors or UIs that need to map cursor byte positions back to grapheme boundaries.
- nth_grapheme(n) returns the n-th grapheme cluster as an Optional[StringSlice], or None when n is out of range.
- split_at_grapheme(n) returns Tuple[StringSlice, StringSlice] holding the prefix [0, n) and suffix [n, count) of grapheme clusters in a single pass, clamping n to the total count.
count_graphemes() now takes a fast path over runs of printable ASCII (U+0020..U+007E). Each such byte has GBP Other and two consecutive safe-ASCII bytes always have a grapheme-cluster break between them (GB999), so safe-ASCII runs can be counted at one grapheme per byte without entering the UAX #29 state machine. On pure-ASCII text this is roughly 10x faster (~0.38 ms vs. ~3.85 ms for 1 MB of English), and ~5-6x faster on ASCII-dominant mixed text (Spanish UN charter). Pure non-ASCII text (Arabic, Russian, Chinese) is unchanged.
Variadics of types have been moved to the TypeList struct. One can write operations such as:
```
comptime assert TypeList[Trait=AnyType, Int, String]().contains[Bool]
```
abort(message) now includes the call site location in its output. The location is automatically captured and printed alongside the message. You can also pass an explicit SourceLocation to override it:
```
abort("something went wrong")
# prints: ABORT: path/to/file.mojo:42:5: something went wrong

var loc = current_location()
abort("something went wrong", location=loc)
```
abort(message) now prints its message on Nvidia and AMDGPU, including block and thread IDs. Previously, the message was silently suppressed on these GPUs. On Apple GPU, the message is silently suppressed for now.
SourceLocation fields (line, col, file_name) are now private. Use the new accessor methods line(), column(), and file_name() instead.
Fixed default alignment in TileTensor.load() and TileTensor.store() to use the caller-specified width parameter instead of Self.element_size.
Added uninitialized memory read detection for float loads. When compiled with -D MOJO_STDLIB_SIMD_UNINIT_CHECK=true, every float load is checked against the debug allocator's poison pattern (the largest finite value of the float type, e.g. FLT_MAX for Float32). A match triggers abort() with a descriptive message. The non-NaN poison pattern lets nan-check and uninit-read-check coexist (a NaN poison would be flagged by nan-check as a legitimate NaN error in kernels that intentionally only write active positions). When disabled (the default), zero runtime overhead. For MAX pipelines, set MODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=true to enable both the debug allocator and the load-time checks automatically.
Expanded Apple Silicon GPU support. Apple Metal GPU is now a more capable Mojo target.
- print() and _printf() now work on Apple Metal GPU. Output is chunked through the Metal os_log path, with a Float32-only formatter that matches Metal's hardware constraints. _printf() currently emits the format string only (not interpolated arguments); |x| < 1e-7 is truncated to 0.0 and there is no scientific notation.
- external_memory[]() (dynamic threadgroup memory) is now supported on Apple Silicon. The compiler and runtime bridge CUDA-style extern shared symbols to Metal's setThreadgroupMemoryLength:atIndex: model, so existing GPU kernels using external_memory[]() work unchanged.
- Apple M5 MMA intrinsics (apple_mma_load, apple_mma_store, _mma_apple) in std.gpu.compute.arch.mma_apple enable hardware matrix multiply-accumulate on Apple GPU.
- Added CompilationTarget.is_apple_m5() to std.sys for detecting Apple M5 targets at compile time; is_apple_silicon() now includes M5 in its check.
- Apple GPU targets now prefer metal4 features by default when the toolchain supports them, automatically appending -metal4 to the arch instead of requiring explicit m5-metal4 selection.
- Atomic ordering: release ordering is not supported on Metal. Apple GPU targets now use monotonic (relaxed) atomic ordering by default.
- Floating-point widths: the compiler now rejects floating-point types wider than 32 bits (Float64/Float80/Float128) for Apple GPU targets, since Metal supports only Float16 and Float32.
Standard library types now use conditional conformances, replacing previous _constrained_conforms_to checks:
- Span: Writable, Hashable
- Tuple, Optional, Variant, and UnsafeMaybeUninit: RegisterPassable
- Variant: Copyable, ImplicitlyCopyable
- Optional: DevicePassable (conditional on element type)
Tuple now conditionally conforms to Defaultable, so generic T: Defaultable code can default-construct tuples when all element types are Defaultable.
OwnedDLHandle.get_symbol() now returns Optional[UnsafePointer[...]] instead of aborting when a symbol is not found. This allows callers to handle missing symbols gracefully.
UnsafePointer is now non-null by design. See the non-null pointer proposal for the full design and migration timeline.

The default null constructor __init__(out self) and __bool__(self) method are now deprecated, and UnsafePointer no longer conforms to Defaultable or Boolable.

To migrate, express nullability explicitly with Optional[UnsafePointer[...]], which has the same layout as UnsafePointer (the null address is the None niche) so nullable pointers remain zero-overhead and can be used across C-FFIs.
```
# Before: null default construction
var ptr = UnsafePointer[Int, origin]()

# After: express absence with Optional
var ptr: Optional[UnsafePointer[Int, origin]] = None

# Before: Bool-based null check
if ptr:
    use(ptr[])

# After: check the Optional, then unwrap
if ptr:
    use(ptr.value()[])
```
If you specifically need a non-null placeholder for a field that will be populated later (for example, a buffer that is allocated on demand) use UnsafePointer.unsafe_dangling(), which returns a well-aligned but dangling pointer. Note that unsafe_dangling() is not a null sentinel: types that lazily allocate must track initialization separately.
GPU primitive id accessors (thread_idx, block_idx, block_dim, grid_dim, global_idx, lane_id, warp_id, cluster_dim, cluster_idx, and block_id_in_cluster) have migrated from UInt to Int.

This is part of a broader migration to standardize on the Int type for all sizes and offsets in Mojo. As a related step in the same migration, TensorCore.load_a() and TensorCore.load_b() now also take Int arguments instead of UInt.

To provide a gradual migration path, explicitly typed *_uint aliases of the seven non-cluster accessors are available temporarily:

Accessor Legacy UInt alias
thread_idx thread_idx_uint
block_idx block_idx_uint
block_dim block_dim_uint
grid_dim grid_dim_uint
global_idx global_idx_uint
lane_id lane_id_uint
warp_id warp_id_uint

The three cluster accessors (cluster_dim, cluster_idx, block_id_in_cluster) migrated directly without *_uint aliases, since their usage was limited.

Code can preserve its prior UInt behavior by using a renaming import of the *_uint alias:
```
- from std.gpu import thread_idx
+ from std.gpu import thread_idx_uint as thread_idx
```
The temporary *_int accessors that briefly existed during the phased migration as a forward-compatibility aid have been removed; use the unprefixed accessors (which now return Int by default). The *_uint aliases will eventually be deprecated and removed as well.
Added IterableOwned trait to the iteration module. Types conforming to IterableOwned implement __iter__(var self), which consumes the collection and returns an iterator that owns the underlying elements.
- List now conforms to IterableOwned.
- Optional now conforms to IterableOwned.
- Deque now conforms to IterableOwned.
- LinkedList now conforms to IterableOwned.
- Dict now conforms to IterableOwned.
- Set now conforms to IterableOwned.
- Counter now conforms to IterableOwned.
- InlineArray now conforms to IterableOwned.
- Span now conforms to IterableOwned (conditional on T: Copyable). The owned iterator yields copies of elements by value.
- Iterator adaptors (enumerate, zip, map, peekable, take_while, drop_while, product, cycle, count, repeat) now conform to IterableOwned.
- Added owned overloads of enumerate(), zip(), map(), peekable(), take_while(), drop_while(), product(), and cycle() that consume the input iterable.
CStringSlice can no longer represent a null pointer. To represent nullability use Optional[CStringSlice] which is guaranteed to have the same size and layout as const char*, where NULL is the empty Optional.
external_call's return_type's requirements has been relaxed from TrivialRegisterPassable to RegisterPassable.
Negative indexing on all stdlib collections has been removed to enable cheap CPU bounds checks by default:
- List
- Span
- InlineArray
- String
- StringSlice
- LinkedList
- Deque
- IntTuple
Using a negative IntLiteral for indexing will now trigger a compile-time error, for example:
```
/tmp/main.mojo:3:12: note: call expansion failed with parameter value(s): (..., ...)
        print(x[-1])
            ^
constraint failed: negative indexing is not supported, use e.g. `x[len(x) - 1]` instead
```
Update any x[-1] to x[len(x) - 1], following the compiler errors to your call sites as above.

This does not affect any MAX ops that support negative indexing.
Bounds checking is now on by default for all collections on CPU, and will show you the call site in your code where you triggered the out of bounds access:
```
def main():
    var x = [1, 2, 3]
    print(x[3])
```
```
At: /tmp/main.mojo:3:12: Assert Error: index 3 is out of bounds, valid range is 0 to 2
```
Bounds checking is still off by default for GPU to avoid performance penalties. To enable it for tests:
```
mojo build -D ASSERT=all main.mojo
```
To turn off all asserts, including CPU bounds checking:
```
mojo build -D ASSERT=none main.mojo
```
alloc[T](count, alignment) will now abort if the underlying allocation failed.
Added Variadic.contains_value comptime alias to check whether a variadic sequence contains a specific value at compile time.
ArcPointer now conditionally conforms to Hashable and Equatable when its inner type T does. Both __eq__ and __hash__ delegate to the managed value, matching C++ shared_ptr and Rust Arc semantics. This makes ArcPointer usable as a Dict key or Set element with value-based equality. Pointer identity is still available via the is operator.
Path now conforms to Comparable, enabling lexicographic ordering and use with sort().
range() overloads that took differently-typed arguments or arguments that were Intable/IntableRaising but not Indexer have been removed. Callers should ensure they're passing consistent integral argument types when calling range().
Consistency now has a default constructor that selects RELEASE ordering on Apple GPU and SEQUENTIAL on all other targets. All Atomic methods and fence use this platform-aware default instead of hard-coding SEQUENTIAL.
NDBuffer has been fully removed. Please migrate to TileTensor.
Added a generic __contains__ method to Span for any element type conforming to Equatable, not just Scalar types.
Fixed blocked_product in tile_layout to zip block and tiler dimensions per mode, matching the legacy blocked_product behavior.
Added Span-based overloads for enqueue_copy, enqueue_copy_from, and enqueue_copy_to on DeviceContext, DeviceBuffer, and HostBuffer, providing a safer alternative to raw UnsafePointer for host-device memory transfers.
String.__len__() has been deprecated. Prefer to use String.byte_length() or String.count_codepoints().
Added map() and and_then() methods to Optional. map() transforms the contained value by applying a function, returning Optional[To]. and_then() chains operations that themselves return an Optional, enabling flat-mapping over fallible computations.
```
var o = Optional[Int](42)

def closure(n: Int) {} -> String:
  return String(n + 1)

var mapped: Optional[String] = o.map[To=String](closure)
print(mapped) # Optional("43")
```
Added std.memory.forget_deinit() to enable low-level code to skip the usual requirement to run a destructor for a value. This function should be used rarely, when building low-level abstractions.
CPU DeviceContext expansion. DeviceContext(api="cpu") is now usable as a stream-ordered execution context for CPU work, paving the way for NUMA-aware CPU dispatch.
- Added DeviceContext.enqueue_cpu_function() and DeviceContext.enqueue_cpu_range() for stream-ordered execution of host functions on CPU DeviceContext instances. enqueue_cpu_function enqueues a single host function; enqueue_cpu_range enqueues a parallel range whose tasks run concurrently but are stream-ordered relative to surrounding work. Argument passing is not yet supported.
- parallelize, parallelize_over_rows (in std.algorithm.backend.cpu.parallelize), and the elementwise overloads in std.algorithm.functional now accept an optional trailing ctx: Optional[DeviceContext] = None. When supplied, the provided CPU DeviceContext is forwarded to sync_parallelize; when omitted, the previous behavior is preserved.
- Added a parallelism_level() overload that takes a CPU DeviceContext and returns the thread-pool size for that specific context, enabling NUMA-specific introspection.
Readable GPU kernel names in profilers. GPU kernels in the standard library and across MAX kernels (elementwise, GEMV, multistage matmul, attention, convolution, MoE, normalization, quantization, BMM, grouped matmul, SM100 matmul, AMD matmul, communication, and sampling) now expose human-readable names in profiler traces such as Nsight Systems, replacing previously mangled KGEN symbols.
tile_io module for TileTensor data movement. Added a tile_io module providing TileTensor copier traits and copy utilities for moving data between memory hierarchies (DRAM/SRAM). The module includes:
- GenericToSharedAsyncTileCopier, which moves a TileTensor from generic memory into shared memory via NVIDIA's cp.async. On AMD and Apple GPUs the underlying async_copy falls back to synchronous loads/stores.
- An optional swizzle: Swizzle parameter on GenericToSharedAsyncTileCopier, mirroring the swizzled write path in LocalToSharedTileCopier.
- A masked: Bool = False parameter on GenericToSharedAsyncTileCopier. When enabled, out-of-bounds vectors receive a zero-byte copy with zero-fill, matching LayoutTensor.copy_from_async[is_masked=True, fill=Fill.ZERO].
- An AsyncTileCopier trait abstracting copier conformance.
TMA gather4 for sparse 2D tensor loads. Added a TMA gather4 operation on SM100 (Blackwell) for loading 4 non-contiguous rows from a 2D tensor in a single TMA instruction, surfaced as the cp_async_bulk_tensor_2d_gather4 intrinsic in std.gpu.memory and integrated with TMATensorTile. The API supports:
- Full 2D tile sparse loads with arbitrary tile_height (multiple of 4) and tile_width, replacing the prior 4-row-per-call limit.
- Arbitrary row_width — previously restricted to the swizzle box width. The API automatically computes the box width from the swizzle constraint and supports non-divisible widths via TMA hardware zero-fill on the last column group, so kernels no longer need to hand-code column-group loops.
1D TMA instructions for SM90+ NVIDIA GPUs. Added 1D TMA (Tensor Memory Accelerator) instruction support in std.gpu.memory. 1D TMA copies do not require a pre-allocated tensormap object on the host, providing greater flexibility than the existing 2D–5D TMA path. New functions: cp_async_bulk_shared_cluster_global, cp_async_bulk_global_shared_cta, cp_async_bulk_prefetch, and cp_async_bulk_reduce_global_shared_cta (the 1D counterpart to cp_async_bulk_tensor_reduce_global_shared_cta, which reduces floating-point values from shared memory into global memory; ADD only).
TileTensor API extensions.
- Added TileTensor.bitcast[target_dtype](), which returns a new TileTensor viewing the same storage and layout under a different element dtype, replacing the TileTensor(x.ptr.bitcast[Scalar[T]](), x.layout) idiom.
- Added TileTensor.flat_load and TileTensor.flat_store as raw-flat accessors that read and write the underlying storage at a linear offset, bypassing the tensor's layout.
- Added a TileTensor.tile() overload that takes the tile shape as a runtime/parameter argument, complementing the existing tile APIs.
- GPU TileTensor.load() and load_linear() now default invariant=True for immutable tensors, enabling the compiler to use ldg for read-only memory accesses.
- Added compile-time bounds checks to TileTensor, ManagedTensorSlice, and crd2idx to catch out-of-range coordinate accesses at compile time.
Layout library extensions.
- Added a compile-time coalesce function for TensorLayout, mirroring the legacy Layout.coalesce algorithm (skip shape-1 dims and merge contiguous dims).
- Added write_repr_to to Layout for writing a debug representation to a Writer.
- vectorize and distribute now accept layouts with runtime dimensions.
- row_major now accepts coord-like arguments directly, no longer requiring them to be wrapped in tuples.
- Introduced weakly compatible layouts, enabling structural compatibility comparisons between layouts and coordinate indices (up to depth 4). Structural equality is now checked via a comptime assert rather than a where clause.
- Changed CoordLike.value() to return Scalar[Self.DTYPE] instead of Int, providing a more expressive return type for layout coordinate values.
- Coord, RowMajorLayout, and ColMajorLayout once again take their parameters as variadic arguments, improving ergonomics when specifying individual coords. Use *splat to pass an existing list.
Several standard library APIs that previously took legacy closures now have unified-closure overloads: parallelize and parallelize_over_rows (in std.algorithm.backend.cpu.parallelize), bench.bencher, DeviceContext.execution_time, and DeviceContext.enqueue_function (the GPU enqueue path, renamed from the previous enqueue_closure).
GPU device APIs.
- Added support for NVIDIA B300 (sm_103a) accelerators. New helpers in std.sys.info and std.gpu.host.info recognize B300 targets so kernels can dispatch correctly on the Blackwell B300 architecture.
- Added DeviceStream.enqueue_host_func(func, user_data) exposing the cuLaunchHostFunc primitive for Mojo kernels and custom ops. Takes a thin def(OpaquePointer[MutAnyOrigin]) -> None callback and an opaque user_data pointer. CUDA-only today; non-CUDA backends raise.
- DeviceContext initialization now runs an automatic GPU health check that detects hardware throttling, uncorrectable ECC errors, and zombie VRAM, and fails device creation with an actionable error message on unhealthy GPUs. Added DeviceContext.run_healthcheck() to re-invoke the check explicitly. Set MODULAR_DEVICE_CONTEXT_DISABLE_HEALTHCHECK=true to disable.
- Optimized GPU elementwise index computation and dispatch with a use_32bit fast path, 4× unrolled grid-stride processing, warp-aligned block sizes, and SM100+ single-tile routing.
AMD GPU intrinsics.
- Added the ds_read_tr8_b64 AMD GPU intrinsic in std.gpu.intrinsics, performing a 64-bit LDS transpose load of 8-bit elements via llvm.amdgcn.ds.read.tr8.b64. Supported on AMD CDNA4+ GPUs.
- Added a Scalar[dtype] overload of readfirstlane so callers no longer need bitcast workarounds to broadcast non-Int32 scalar values across an AMD GPU wavefront.
- AMDBufferResource.load_to_lds in std.gpu.intrinsics now lowers to the .ptr. form of the AMDGPU buffer-load-to-LDS intrinsic, fixing a strided-layout regression on MLA layouts where cache_depth != depth and head_dim_offset != 0. A new async_copies: Bool = False parameter opts in to attaching the amdgpu.AsyncCopies alias scope on the load, enabling LLVM vmcnt relaxation.
- Added a broadcast=True parameter to GPU warp_id() (and related id accessors) so callers can avoid manual warp.broadcast(warp_id()) patterns.
Math, debug, and stdlib APIs.
- align_down and align_up now accept generic SIMD[dtype, width] integer values, replacing the previous UInt-only overloads.
- Extended FastDiv and mulhi to support 64-bit integer types, with NVIDIA-specific llvm.nvvm.mulhi.ull/ll intrinsics and 128-bit arithmetic on other targets.
- Added check_bounds for collections that asserts on out-of-range indices and reports the user's call site instead of stdlib source.
- debug_assert now accepts a call_location parameter, allowing callers to override the reported SourceLocation so assertion errors can point to user code rather than stdlib internals.
- Swapped the ordering arguments of Atomic.compare_exchange so success_ordering is listed before failure_ordering, matching the convention used by C++, Rust, and other languages.
- InlineArray's storage constructor now uses debug_assert[assert_mode="safe"] for the element-count check, so size mismatches are caught by default instead of only with -D ASSERT=all.

Accessor	Legacy `UInt` alias
`thread_idx`	`thread_idx_uint`
`block_idx`	`block_idx_uint`
`block_dim`	`block_dim_uint`
`grid_dim`	`grid_dim_uint`
`global_idx`	`global_idx_uint`
`lane_id`	`lane_id_uint`
`warp_id`	`warp_id_uint`

Tooling changes

The Mojo debugger now shows a Variant variable's active type name and value in LLDB — e.g. Int(42) or String("hello") — instead of exposing raw _DefaultVariantStorage internals.
The Mojo debugger now displays scalar types (e.g. UInt8, Float32) as plain values instead of ([0] = value), and elides internal _mlir_value wrapper fields from struct display.
mojo format no longer supports the deprecated fn keyword, nor the removed owned argument convention.
Comptime function calls now print more nicely in error messages and generated documentation, not including VariadicList/VariadicPack and including keyword argument labels when required.
LSP and REPL responsiveness.
- Code completion and signature help in REPL/notebook contexts are now amortized O(1) per request by caching parsed prior cells across requests, eliminating quadratic O(N²) slowdown in long sessions.
- LSP parse time for files with docstring code blocks (e.g. dict.mojo) is roughly 2× faster, using signature-only resolution for transitive dependencies inside docstring code blocks.
- LSP parse time is further reduced by deferring body resolution of imported bytecode declarations and resolving named imports lazily, avoiding eager pulls of large transitive dependencies.
mojo CLI and toolchain.
- mojo --version now prints a semantic Mojo version (for example, 1.0.0...) instead of an internal build identifier, and the same version is used wherever the compiler performs version checks.
- mojo build --print-supported-targets now lists registered targets sorted alphabetically, with a graceful empty-list message.
- The compiler now selects the target's baseline CPU when cross-compiling with --target-triple without --target-cpu and the host and target architectures differ.
- ASAN-instrumented Mojo binaries on macOS now use llvm-symbolizer instead of atos, so stack traces report the full inlined call chain through user functions.
mojo doc and docstring validation.
- mojo doc now preserves parameterized type names (for example List[K], Optional[V], UnsafePointer[Scalar[dtype]]) in the API doc JSON "type" fields, instead of emitting only the bare base name.
- mojo doc now emits a diagnostic when a public Mojo module has no module-level docstring and -mojo-diagnose-missing-doc-strings is active. Private modules and modules nested inside private packages are exempt.
- Docstring validation no longer requires inferred parameters (those before // in a parameter list) to be documented; documenting them remains valid.
- Docstring validation now accepts ! and ? as valid sentence-ending punctuation for summaries, section bodies, and argument descriptions.
- def ... raises functions now require a Raises: docstring section like any other raising function, and the isDef field has been removed from mojo doc JSON output.
Debugger UX.
- The Mojo debugger now displays Optional[T] variables as None or Some(value) in LLDB instead of exposing raw _DefaultVariantStorage internals.
- The Mojo debugger now correctly displays UnsafePointer[T] values in LLDB for all pointed-to types, including signed integers (no longer rendered as huge unsigned values), Bool (True/False), and floats.
- The Mojo debugger now displays StringSlice, StaticString, and their underlying Span[Byte] values as quoted strings in LLDB.
- At -O0, trivially destructible types (Int, Float, Bool, SIMD, etc.) now remain visible in the debugger through the end of their lexical scope instead of disappearing at the ASAP destruction point.
mojo format (mblack) now correctly parses the new unified-closure syntax including raises {captures} effect ordering, and no longer inserts a spurious space between ^ and the operand in var^ captures.
Mojo package files (.mojopkg) now use format version 2 with zstd-compressed MLIR bytecode, significantly reducing package, wheel, and Docker image sizes.
Added a --mojo-version flag to mojo-lsp-server for verifying the Mojo version that the LSP is using.
Removed the legacy MOJO_ENABLE_STACK_TRACE_ON_ERROR and MOJO_ENABLE_STACK_TRACE_ON_CRASH environment variables. Instead, set the MODULAR_DEBUG environment variable to stack_trace_on_error to enable generation of stack traces when a Mojo program raises an error.

GPU programming

Added support for AMD MI250X accelerators.

❌ Removed

The escaping function effect is no longer supported. Migrate def(...) escaping -> T closures to use an explicit capture list {...} (see the closure refactor entry under Language enhancements).
The deprecated @doc_private decorator has been removed. Use @doc_hidden instead.

Removed the store_release, store_relaxed, load_acquire, and load_relaxed helpers from std.gpu.intrinsics. Use Atomic[dtype, scope=...].store and Atomic[dtype, scope=...].load with the desired Ordering instead:

# Before
from std.gpu.intrinsics import store_release, load_acquire
store_release[scope=Scope.GPU](ptr, value)
var v = load_acquire[scope=Scope.GPU](ptr)

# After
from std.atomic import Atomic, Ordering
Atomic[dtype, scope="device"].store[ordering=Ordering.RELEASE](ptr, value)
var v = Atomic[dtype, scope="device"].load[ordering=Ordering.ACQUIRE](ptr)

Several constructs deprecated in 26.2 are no longer accepted:
- The @register_passable and @register_passable("trivial") decorators are no longer supported. Conform to the RegisterPassable and TrivialRegisterPassable traits instead. Use of either decorator now produces a hard error pointing to the trait equivalent.
- The legacy __moveinit__ and __copyinit__ method names are no longer auto-rewritten to the unified __init__ form. Rename these methods to __init__ with keyword-only take: Self and copy: Self arguments, respectively, as introduced by init unification in 26.2. Existing legacy spellings now fail to compile with errors such as no matching function in initialization rather than being silently rewritten.
API removals beyond the deprecation removals already noted:
- Removed the param_env.mojo module. Use defines.mojo instead.
- Removed LinkedList.__getitem__. Indexing a LinkedList is O(n), and exposing __getitem__ encouraged accidentally quadratic code; iterate the list instead.
- Removed the unused UIntSized trait and its prelude re-export.
- Removed the pdl_level parameter from elementwise, reduction, and reducescatter kernel APIs. PDL usage is now an internal compile-time default.

🛠️ Fixed

Fixed math.sqrt on Float64 on NVIDIA GPU producing a cryptic could not find LLVM intrinsic: "llvm.nvvm.sqrt.approx.d" failure at LLVM IR translation time. math.sqrt now rejects Float64 on NVIDIA GPU at compile time with the message DType.float64 isn't supported for approx sqrt on NVIDIA GPU. The existing math.sin and math.cos constraint messages were also sharpened to name the op (DType.float64 isn't supported for sin/cos on NVIDIA GPU). (Issue #6434)
Fixed pack inference failing with could not infer type of parameter pack ... given value with unresolved type when passing list, dict, set, or slice literals to a *Ts-bound variadic pack parameter (e.g. def foo[*Ts: Iterable](*args: *Ts)). Pack inference now applies the same default-type fallback that single-argument trait-bound parameters already use, so foo([1, 2, 3], [4, 5, 6]) resolves each literal to its default type (e.g. List[Int]) before binding the pack.
Fixed mojo aborting at startup with std::filesystem::filesystem_error when $HOME is not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412)
mojo run and mojo debug now honor -Xlinker flags by loading the referenced shared libraries into the in-process JIT. Previously the flags were dropped (with a -Xlinker argument unused warning), leaving programs that called into external shared libraries via external_call unable to resolve those symbols at runtime (so mojo build worked but mojo run did not). The supported forms mirror what the system linker accepts: -Xlinker -L<dir>, -Xlinker -l<name>, -Xlinker -rpath <dir>, and -Xlinker <absolute-path-to-shared-library>. Flags that have no meaning under JIT are reported as a warning and ignored. (Issue #6155)
Fixed libpython auto-discovery failing for Python 3.14 free-threaded builds. The discovery script constructed the library filename without the ABI flags suffix (e.g. looked for libpython3.14.dylib instead of libpython3.14t.dylib). (Issue #6366)
Fixed RTLD.LOCAL having the wrong value on Linux. It was set to 4 (RTLD_NOLOAD) instead of 0, causing dlopen with RTLD.NOW | RTLD.LOCAL to fail. (Issue #6410)
Fixed mojo format crashing after upgrading Mojo versions due to a stale grammar cache. (Issue #6144)
Fixed atof producing incorrect results for floats near the normal/subnormal boundary (e.g., Float64("4.4501363245856945e-308") returned half the correct value). (#6196)
Issue #5872: Fixed a compiler crash ("'get_type_name' requires a concrete type") when using default Writable, Equatable, or Hashable implementations on structs with MLIR-type fields (e.g. __mlir_type.index). The compiler now correctly reports that the field does not implement the required trait.
Fixed Atomic.store silently dropping the requested scope. The previous implementation lowered to atomicrmw xchg without forwarding syncscope, so Atomic[..., scope="device"].store(...) was emitting a system-scope store on NVPTX (extra L2/NVLink fences) and an over-synchronized store on AMDGPU. Atomic.store now lowers via pop.store atomic syncscope(...), emitting st.release.<scope> on NVPTX and a properly-scoped LLVM atomic store on AMDGPU. The Mojo API surface is unchanged.
Fixed Process.run() not inheriting the parent's environment variables. Child processes spawned via Process.run() now correctly receive the parent's environment.
Fixed \xhh and \ooo escape sequences in string literals being interpreted as raw bytes instead of Unicode code points, which produced malformed UTF-8 for values >= 0x80. The escapes now match Python str semantics (and the existing \u/\U handling): "\x85" encodes U+0085 (NEL) as two UTF-8 bytes and ord("\x85") returns 133 instead of 5. Code that relied on \xhh to emit a single raw byte for non-ASCII values must construct the bytes explicitly (for example via a List[Byte] literal). (Issue #2842)
Fixed incorrect data layout for MI250X AMDGPU architectures. (Issue #6451
Fixed Apple Silicon target detection on macOS 26 producing unrecognized arch strings like metal:2-metal4 when the installed Xcode could not compile Metal 4.0; the -metal4 suffix is now applied only when the toolchain supports it.
Fixed UnsafePointer.gather, UnsafePointer.scatter, and strided_load silently reading zero on Apple GPU. The per-lane fallback reconstructed pointers via unsafe_from_address=Int(addr), yielding a generic-address-space pointer the Apple AIR backend could not resolve. The fallback now uses typed pointer arithmetic on Apple GPU; NVIDIA, AMD, and CPU paths are unchanged.
Fixed rotate_left and rotate_right intrinsics failing to lower on Apple GPU. Both now lower correctly to the Apple AIR backend.
Fixed TileTensor.write_to() only handling 2D static-shape tensors; 1D, 3D+, nested-layout, and dynamic-shape tensors now print correctly via a generic elementwise fallback, and all ranks use a bracket-delimited, comma-separated format.
Fixed incorrect alignment in TileTensor.__getitem__.
Fixed TileTensor SIMD loads/stores on CPU to use alignment=1, preventing segfaults when underlying data is not naturally aligned. GPU still uses aligned access where the layout guarantees alignment.
Fixed complement() in tile_layout returning a static shape of 0 when given UNKNOWN_VALUE as the size; it now propagates UNKNOWN_VALUE so downstream layout algebra falls back to runtime dimensions, restoring correct bounds checks for LayoutTensor.flatten().vectorize[N]().
Fixed idx2crd returning incorrect coordinates for nested layouts.
Fixed mojo --version printing the MAX version instead of the Mojo compiler version.
Fixed comptime and/or expressions to accept any Boolable operands, matching runtime behavior. This also enables mixed-type expressions like comptime if some_Bool and some_Optional.
Fixed several codegen correctness issues affecting valid Mojo programs: an SRoA miscompile that incorrectly promoted arrays accessed via dynamic offsets through a constant GEP; a use-after-free where destructors of live owned values were inserted before, rather than after, a lit.ref.store into a ref with #lit.any.origin; silent memory corruption when calling abi("C") functions that returned structs via sret; and bogus existing function with conflicting attributes errors when calling the same external function more than once with an sret/byval ABI.
Fixed several mojo-lsp-server crashes affecting REPL/notebook contexts, parameter-pack-related diagnostics, files importing from .mojopkg, and files using stateless closures. The LSP also no longer mistakes REPL buffer identifiers (which contain a .mojo extension) for relative module imports.
Fixed several debugger display issues: variables after their ASAP destruction point at -O0 now correctly show "not available" instead of stale values; unsigned integers (UInt, UInt8, etc.) display with correct unsigned semantics; ref loop variables show index instead of pointer<index>; String fields typed as Scalar[T] and Tuple values display correctly.
Fixed two mojo format (mblack) issues: it no longer loses the t prefix when splitting long t-string literals across lines, and no longer inserts a stray space between * and a complex operand in variadic pack unpacking annotations.
Fixed BitSet.set_all and BitSet.toggle_all writing ~0 to every underlying 64-bit word, including bits beyond the logical size when size was not a multiple of 64. Those stray high bits were counted by __len__, producing incorrect population counts; the methods now mask off the unused high bits.
Fixed syncwarp on AMD GPUs, which was previously implemented as a no-op. It now lowers to llvm.amdgcn.wave.barrier, providing the control-flow synchronization required to correctly sequence shared-memory writes followed by reads across lanes.
Fixed isnan, isinf, and isfinite failing during LLVM lowering for float8_e3m4 and float4_e2m1fn. float4_e2m1fn (no NaN/Inf encodings) folds to constant branches; float8_e3m4 casts through bfloat16 to reuse the existing llvm.is.fpclass path.

✨ Highlights​

Documentation​

Language enhancements​

Language changes​

Library changes​

Tooling changes​

GPU programming​

❌ Removed​

🛠️ Fixed​