IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content

Mojo v1.0.0b2

Highlights

  • Collections no longer require Copyable elements. The core collection types — List, Deque, LinkedList, InlineArray, Dict, and Set — now accept move-only elements, with Movable & ImplicitlyDestructible as the new minimum bound instead of Copyable. This removes a longstanding source of friction where storing a value in a collection forced it, and its contents, to become copyable. Copy-requiring methods stay gated on Copyable. See Collections and iterators.

  • Trailing where clauses in more places. Trailing where clauses are now supported on struct declarations, on comptime alias declarations, and to discharge constraints from constrained types used anywhere in a signature. A single trailing where can simultaneously constrain a declaration and satisfy the requirements of the types it uses, and the compiler suggests the missing clause when one is needed. See Language enhancements.

  • enqueue_function() and compile_function() take a single kernel argument. DeviceContext.enqueue_function[func]() and compile_function[func]() now accept the kernel parameter once instead of requiring it twice, cleaning up every GPU function callsite. The old two-argument forms and the transitional *_experimental aliases are deprecated. See Device context and execution.

  • Unicode-aware string subscripting. String and StringSlice now support keyword subscripts that index or slice by Unicode codepoint ([codepoint=...]) or by grapheme cluster ([grapheme=...]), so String("🔄🔥🔄")[codepoint=1:2] returns "🔥". This makes correct, encoding-aware text indexing straightforward without manual byte arithmetic. See String and text.

  • Faster Python → Mojo interop. Calls into Mojo from Python now carry significantly less per-call overhead: non-kwargs callables use CPython's METH_FASTCALL convention, PythonObject.__del__() skips the GIL round-trip when the calling thread already holds the GIL, and integer conversions fast-path exact Python int values. No code changes are required to benefit. See Python interoperability.

  • New and expanded documentation. This release adds a Closures page, new TileTensor guides, expanded coverage of partially bound and unbound types and rebind(), and several new reference pages — built-in types, function overloads, closure declarations, CLI feature toggles, and docstrings — plus a downloadable Mojo basics cheat sheet. See Documentation.

  • mojo package is now mojo precompile. The packaging command has been renamed and the .mojopkg extension deprecated in favor of .mojoc — affecting everyone who precompiles Mojo packages, notably custom-op authors. The new .mojoc packages are also significantly smaller, with faster compile and load times. See Tooling changes.

  • Inspect and clear the Mojo compile cache. New mojo --print-cache-location and mojo --clear-cache flags report and purge the on-disk compile cache (.mojo_cache), honoring the standard cache-path precedence. --clear-cache prompts by default; pass -f to skip it for scripting. See Tooling changes.

  • fn is now an error. Uses of the legacy fn keyword now produce a compilation error rather than a warning, completing the def/fn unification: def is Mojo's single function-declaration keyword. Move any remaining fn declarations to def. See Removed.

  • Implicit std imports are now an error. Standard library imports must be fully qualified; the compiler no longer implicitly resolves bare std module names. Besides making imports explicit, this stops the compiler from squatting on names like algorithm and memory, freeing them for user modules. See Language changes.

  • Reflection API restructuring. reflect[T] is now a comptime alias for the Reflected[T] handle type rather than a function, so call sites drop their parentheses (reflect[T].name()), and the deprecated free-function reflection API (get_type_name(), the struct_field_* family, and ReflectedType[T]) has been removed in favor of methods on reflect[T]. A new reflect_fn[func] alias adds parallel function-side reflection. See Reflection.

Documentation

Language enhancements

  • Trailing where clauses are now supported in more declaration contexts:

    • On struct declarations. Constraints are part of the type and checked at every binding site:

      struct SIMD[dtype: DType, size: Int]
      where dtype != DType.invalid
      where size.is_power_of_two():
      ...
    • On comptime alias declarations:

      comptime PositiveOnly[N: Int]: AnyType where N > 0 = ...
    • To discharge constraints from constrained types appearing anywhere in a signature. A single trailing where will simultaneously constrain the declaration and satisfy the requirements of types used within the same signature:

      struct Matrix[m: Int, n: Int] where m > 0 where n > 0: ...

      def solve_linear_system[n: Int, a: Matrix[n, n], b: Vector[n]]() -> Vector[n]
      where n > 0:
      ...

      If no trailing where discharges a constraint, the compiler reports an error and suggests the missing clause.

  • Added an @unavailable decorator that marks a function or method as intentionally unavailable. Unlike @deprecated (which emits a warning), referencing an @unavailable declaration is an error. Like @deprecated, it accepts either a reason message (positional or as reason=) or a use=symbol replacement. When use=symbol is given, the error includes a fix-it that renames the call to symbol.

    struct Foo:
    @unavailable("message here...")
    def foo(self) -> Int:
    ...

    @unavailable(use=new_api)
    def old_api():
    ...

    def new_api():
    pass
  • Types may now be conditionally "ImplicitlyDestructible" with a where clause:

    @explicit_destroy("Message when implicitly destroyed")
    struct ConditionallyLinearType[T: AnyType](
    ImplicitlyDestructible where conforms_to(T, ImplicitlyDestructible)
    ):
    var data: Self.T
  • Mojo now supports building types that support implicit conversions for widening origins, allowing code like this to "just work" without rebind:

    def origin_superset_conversion(
    a: String, b: String, c: Bool
    ) -> Pointer[String, origin_of(a, b)]:
    if c: # These pointers implicitly convert.
    return Pointer(to=a)
    else:
    return Pointer(to=b)
  • Types can parameterize the out argument modifier when they want to be bindable to alternate address spaces, for example:

    struct MemType(Movable):
    # Can be constructed into any address space.
    def __init__[addr_space: AddressSpace](out[addr_space] self):
    ...

    # Only constructable into GLOBAL address space.
    def __init__(arg: Int, out[AddressSpace.GLOBAL] self):
    ...
  • ref parameters can now use generic address spaces, for example ref[origin, _]. The generic address space is auto-parameterized onto the function signature.

Language changes

  • Support for "set-only" accessors has been removed. You need to define a __getitem__ or __getattr__ to use a type that defines the corresponding setter. This eliminates a class of bugs determining the effective element type.

  • The register_passable effect keyword has been removed. Register passability is now computed implicitly from a type's contents, so the explicit keyword is no longer needed and is no longer accepted.

  • Implicit std imports are now an error, following a period of deprecation. Imports from the standard library must now be fully qualified. The compiler thus no longer squats on these module names, paving the way for user modules named algorithm, memory, etc.

  • The handling of the abi effect on @export functions has tightened:

    • Specifying ABI="C" in an @export decorator is now deprecated; abi("C") should be used instead.

      @export("old", ABI="C")
      def old(): pass

      @export("new")
      def new() abi("C"): pass
    • Functions marked @export must now be given an explicit abi effect, rather than relying implicitly on the default (equivalent to abi("Mojo")). The compiler will produce a warning on missing abi effects, which will become an error in a future release.

      Note that the main function is excepted from this. It is always implicitly @exported, and in the case that main is explicitly @exported, it is implicitly given the correct ABI. However, if a user both explicitly @exports main and provides an incompatible ABI (for example, raises and abi("C")), then an error is still emitted.

    • Functions marked as raises may no longer be given the abi("C") effect or be @exported as such using the deprecated ABI="C" option.

  • where clauses in parameter lists (param-where) are now deprecated. Move where clauses from the parameter list to a trailing where on the declaration. Note that struct and comptime declarations temporarily lose where support until declaration-level where lands.

Library changes

Type system and traits

  • The ImplicitlyCopyable, Intable, Equatable, Indexer, and Writer traits no longer inherit from ImplicitlyDestructible. Generic code that relied on receiving the destructor bound transitively through these traits (or through Comparable, which inherits from Equatable) must now spell it out explicitly, for example T: ImplicitlyCopyable & ImplicitlyDestructible or T: Indexer & ImplicitlyDestructible. In practice, most generic code should prefer T: Copyable instead, per the guidance in ImplicitlyCopyable's docstring.

  • The __init__ method required by the Movable trait has had its named argument changed from take to move. Explicitly calling a move initializer is now SomeObject(move=) instead of SomeObject(take=).

  • Added is_trivially_movable(), is_trivially_copyable(), and is_trivially_destructible() to std.memory. These helper functions return whether a type's move constructor, copy constructor, or destructor is trivial (that is, a bit-copy or a no-op).

Reflection

  • reflect[T] is now a comptime alias for the Reflected[T] handle type rather than a function returning a zero-sized handle instance. All methods on Reflected[T] are @staticmethods, and the type is no longer constructible. Drop the parens at call sites:

    # Before
    comptime r = reflect[Point]()
    print(r.field_count())
    print(reflect[Point]().name())
    comptime y_handle = reflect[Point]().field_type["y"]()
    var v: y_handle.T = 3.14

    # After
    comptime r = reflect[Point]
    print(r.field_count())
    print(reflect[Point].name())
    comptime y_handle = reflect[Point].field_type["y"]
    var v: y_handle.T = 3.14

    field_type[name] is now a parametric comptime member alias that yields Reflected[FieldT] directly — no trailing (), and the result is fully composable (for example reflect[T].field_type["x"].name()). The previously deprecated free functions get_type_name(), get_base_type_name(), and the struct_field_* family (along with the ReflectedType[T] wrapper) have been removed; use the corresponding methods on reflect[T]:

    RemovedReplacement
    get_type_name[T]()reflect[T].name()
    get_base_type_name[T]()reflect[T].base_name()
    is_struct_type[T]()reflect[T].is_struct()
    struct_field_count[T]()reflect[T].field_count()
    struct_field_names[T]()reflect[T].field_names()
    struct_field_types[T]()reflect[T].field_types()
    struct_field_index_by_name[T, name]()reflect[T].field_index[name]()
    struct_field_type_by_name[T, name]()reflect[T].field_type[name]
    struct_field_ref[idx, T](s)reflect[T].field_ref[idx](s)
    offset_of[T, name=name]()reflect[T].field_offset[name=name]()
    offset_of[T, index=index]()reflect[T].field_offset[index=index]()
    ReflectedType[T]Reflected[T]
  • Added ReflectedFn[func], a function-side reflection handle accessed via the reflect_fn[func] comptime alias. Exposes function introspection through static methods, paralleling the type-side Reflected[T] API:

    from std.reflection import reflect_fn

    def my_func(x: Int) -> Int:
    return x + 1

    def main():
    print(reflect_fn[my_func].display_name()) # "my_func"
    print(reflect_fn[my_func].linkage_name()) # mangled symbol name

Pointer and memory

  • Added the UntrackedOrigin and UnsafeAnyOrigin origin aliases (and their Mut/Immut variants) as the new names for ExternalOrigin and AnyOrigin, respectively. UntrackedOrigin is the empty origin: it aliases nothing, so the lifetime checker has nothing to track, and it remains a supported tool for interfacing with memory from outside the Mojo program. UnsafeAnyOrigin is the universal origin: it might alias anything, defeating lifetime extension and exclusivity checking, so its Unsafe prefix marks it as an escape hatch slated for deprecation and removal.

    The origin-discarding cast methods on UnsafePointer, TileTensor, and LayoutTensor have correspondingly been renamed from as_any_origin() to as_unsafe_any_origin().

    The old ExternalOrigin, ImmutExternalOrigin, and MutExternalOrigin aliases are now deprecated and emit a deprecation warning when referenced; use UntrackedOrigin, ImmutUntrackedOrigin, and MutUntrackedOrigin respectively instead. The deprecated aliases still forward to the new names, so existing code keeps compiling until they are removed in a future release.

  • Added the layout-aware alloc()/dealloc() allocation API in memory.alloc. alloc() returns an Allocation[T], an owning handle that bundles the allocated pointer with the Layout it was allocated with, and dealloc() consumes that handle to release the storage. A Layout[T] bundles an element count and alignment into a single value, keeping size and alignment requirements explicit and co-located at every call site.

    Allocation, and its bare layout-less counterpart ThinAllocation, are @explicit_destroy types: the compiler forces every allocation to be released on all paths — by passing it to dealloc(), or by taking ownership of the raw pointer with unsafe_leak() — guarding against silent leaks, double-frees, and use-after-free. These APIs are intended to eventually replace the raw-pointer allocation APIs to promote memory safety.

    from std.memory import alloc, dealloc, Layout

    var layout = Layout[Int32](count=4)
    var allocation = alloc(layout)
    var ptr = allocation.unsafe_ptr()
    for i in range(layout.count()):
    (ptr + i).init_pointee_move(i)
    dealloc(allocation^)
  • Added a WeakPointer[T] type to std.memory.arc_pointer, providing weak references to an ArcPointer[T] for building self-referential and cyclically referential data structures that can still be destroyed.

  • UnsafePointer.unsafe_from_address() now has an overloaded constructor that takes an IntLiteral and emits a compile-time assertion if the address is invalid (0 or negative).

  • UnsafeUnion now propagates the address space of its origin instead of defaulting to the GENERIC address space, allowing it to be used with address-space-specific memory such as GPU shared memory.

Collections and iterators

  • The core collection types no longer require their element type to be CopyableMovable & ImplicitlyDestructible is now the minimum bound. This applies to List[T], Deque[T], LinkedList[T], InlineArray[T, size], both type parameters of Dict[K, V, H] (along with SwissTable/SwissTableEntry/OwnedKwargsDict and the loosened KeyElement trait), and Set[T]. Copy-requiring methods stay gated on Copyable. Counter[V] is unchanged. Dict.setdefault() and Set.add() now take their argument by var T; for move-only types call them as d.setdefault(key^, default) or set.add(value^).

  • List[T] now conditionally conforms to ImplicitlyDestructible: a List is implicitly destructible only when its element type T is. This lets a List hold elements that must be explicitly destroyed (dropping such a List would otherwise leak them), at the cost of a stricter check in generic code.

    Generic code that takes a List by value with only a Movable element bound now fails to compile for every T. Previously the error was deferred and only fired when T was instantiated with a non-ImplicitlyDestructible type. Add & ImplicitlyDestructible to the element bound:

    # Now errors for every `T` (previously only when `T` lacked a destructor):
    def foo[T: Movable, //](var list: List[T]):
    pass

    # Fix: require the destructor bound explicitly.
    def foo[T: Movable & ImplicitlyDestructible, //](var list: List[T]):
    pass

    Structs that store a List are affected the same way. Either constrain the element type, or better yet, propagate the conditional conformance so your type supports explicitly-destroyed elements too, forwarding cleanup through destroy_with():

    # Option 1: require the element type to be implicitly destructible.
    struct Foo[T: Movable & ImplicitlyDestructible]:
    var list: List[Self.T]

    # Option 2: conditionally conform, and forward explicit destruction.
    @explicit_destroy("...")
    struct Foo[T: Movable](
    ImplicitlyDestructible where conforms_to(T, ImplicitlyDestructible),
    ):
    var list: List[Self.T]

    def destroy_with(deinit self, f: Some[def(var Self.T)]):
    self.list^.destroy_with(f)
  • A new BinaryHeap collection has been added to the std.collections module. This is a list-backed binary max-heap.

  • Added nth() as a default method on the Iterator trait. It advances the iterator by n elements (destroying them) and returns the next element, or None if the iterator runs out before reaching index n.

    var l = [10, 20, 30, 40]
    print(iter(l).nth(0).value()) # 10
    print(iter(l).nth(3).value()) # 40
    var missing = iter(l).nth(10) # None (Optional)
  • Added take() and drop() iterator adapters to std.itertools. take(iter, n) yields the first n elements, and drop(iter, n) drops the first n elements. They compose naturally to select sub-ranges of any iterable:

    from std.itertools import take, drop

    var nums = [1, 2, 3, 4, 5]
    for x in take(drop(nums, 1), 3):
    print(x) # 2, 3, 4
  • Added an index() method to LinkedList for finding the first occurrence of a value. Unlike Python's list.index(), it omits the start/stop parameters.

  • Dict now defers its backing-buffer allocation until the first insertion. Default-constructed and capacity=0 dictionaries no longer perform any heap allocations.

String and text

  • String.as_bytes_mut() has been renamed to String.unsafe_as_bytes_mut(), to reflect that writing invalid UTF-8 to the resulting Span[Byte] can lead to later issues like out of bounds access.

  • Several StringSlice constructors are now deprecated.

    • StringSlice(ptr=..., length=...) is deprecated; use StringSlice(unsafe_from_utf8=Span(...)) instead.
    • StringSlice(unsafe_from_utf8_ptr=...) (taking a raw nul-terminated UnsafePointer[Byte] or UnsafePointer[c_char]) is deprecated; construct a CStringSlice from the pointer and use the new StringSlice(unsafe_from_utf8=CStringSlice(...)) constructor instead.
  • String and StringSlice now expose a bytes() method that returns a new BytesIter, an iterator over the raw UTF-8 bytes of the string. This complements the existing codepoints() and graphemes() iterators by operating at the byte level without interpreting multi-byte UTF-8 sequences.

    var s = StringSlice("é") # Encoded in UTF-8 as 0xC3 0xA9.
    for b in s.bytes():
    print(b) # 195, 169
  • String and StringSlice now support Unicode-aware subscripting:

    • A keyword-only [codepoint=...] subscript indexes or slices by Unicode codepoint offset, for example String("🔄🔥🔄")[codepoint=1:2] returns "🔥".
    • A [grapheme=...] subscript indexes by grapheme, for example String("👨‍🚀🧑‍🌾क्षि")[grapheme=1] returns "🧑‍🌾".

Python interoperability

  • The CPython FFI bindings now carry the abi("C") effect. User-written Python extension callbacks passed to def_py_c_function(), def_py_c_method(), or PyCapsule_New() must add abi("C") to their signatures, for example def my_func(self: PyObjectPtr, args: PyObjectPtr) abi("C") -> PyObjectPtr:. Functions registered through the higher-level def_function(), def_method(), and def_staticmethod() paths are unaffected.

  • PythonObject convertibility got simplified and cleaned up. When working with types that required custom conversions to PythonObject, we used to write code like this:

    struct MyCustomType(ConvertibleToPython, ImplicitlyCopyable):
    def to_python_object(var self) raises -> PythonObject:
    return PythonObject( ... custom logic ...)

    def hi_python(a: Some[ImplicitlyCopyable & ConvertibleToPython]) raises:
    print(t"Hi, {a.to_python_object()}!")

    def example():
    hi_python(MyCustomType())

    This approach allows custom types to implement ConvertibleToPython to get a domain specific encoding as a Python object. Mojo has simplified this by making all ConvertibleToPython types implicitly convert to PythonObject, so this can/should be simplified to:

    def hi_python(a: PythonObject) raises:
    print(t"Hi, {a}!")
  • Python -> Mojo FFI calls registered through PythonModuleBuilder and PythonTypeBuilder have significantly reduced per-call overhead:

    • Non-kwargs callables registered with def_function() / def_method() / def_staticmethod() now use CPython's METH_FASTCALL calling convention rather than METH_VARARGS. Kwargs-accepting functions still use METH_VARARGS | METH_KEYWORDS.

    • PythonObject.__del__() skips the PyGILState_Ensure / PyGILState_Release round-trip when the current thread already holds the GIL (checked via PyGILState_Check). On the common Python -> Mojo FFI path (where CPython hands the callee an already-held GIL) the destructor pays just the check and a direct Py_DecRef. The public contract is unchanged—dropping a PythonObject from a thread that doesn't hold the GIL remains safe.

    • Int(py=obj) and Scalar[IntDType](py=obj) fast-path exact Python int via PyLong_AsSsize_t.

  • Extended PyObjectFunction to support 7- and 8-argument signatures, adding the corresponding type aliases and @implicit constructor overloads.

Other library changes

  • The reduction axis of the std.algorithm reductions (sum, product, mean, max, min, and the underlying _reduce_generator plus the CPU and GPU backends) is now a keyword-only compile-time parameter named reduce_dim instead of a runtime argument. Pass it in the parameter list, for example sum[..., reduce_dim=axis](shape, ctx).

  • Added dual_elementwise() to std.algorithm.functional, which executes two elementwise functions over their respective shapes in a single GPU kernel launch, fusing two independent elementwise passes into one.

  • The default seed for random.Random, random.NormalRandom, and the internal _PhiloxWrapper has changed from 0 to 0x3D30F19CD101 (67280421310721) to match PyTorch's at::Philox4_32_10 default. Calls that omitted the seed argument will now produce a different output stream; pass seed=0 explicitly to keep the previous behavior.

  • Added Random.step_uniform_unbiased() and NormalRandom.step_normal_4() primitives to the Philox RNG. step_uniform_unbiased() returns four Float32 values in (0, 1) using all 32 raw bits; step_normal_4() returns four normals from a single Philox step via same-step Box-Muller pairing.

GPU programming

Device context and execution

  • DeviceContext.enqueue_function[func]() and DeviceContext.compile_function[func]() now accept a single kernel argument instead of requiring it to be passed twice. The previous two-argument forms enqueue_function[func, func]() and compile_function[func, func]() are deprecated. The transitional enqueue_function_experimental() and compile_function_experimental() aliases are also deprecated; switch to enqueue_function() / compile_function().

    # Before
    ctx.enqueue_function[my_kernel, my_kernel](grid_dim=1, block_dim=1)
    ctx.enqueue_function_experimental[my_kernel](grid_dim=1, block_dim=1)

    # After
    ctx.enqueue_function[my_kernel](grid_dim=1, block_dim=1)
  • Added DeviceContextList[size] in std.gpu.host: a fixed-size, Copyable/ImplicitlyCopyable/Sized collection of DeviceContext values. Multi-device custom-op execute methods now receive a DeviceContextList[N] — the graph compiler synthesizes one from the per-device contexts attached to the op via a variadic constructor. Kernels can index into it with dev_ctxs[i] (runtime) or dev_ctxs.__getitem_param__[i]() (comptime), and iterate with len(). This replaces the previous DeviceContextPtrList pattern.

    from gpu.host import DeviceContext, DeviceContextList

    @compiler.register("mo.distributed.allreduce.sum")
    struct DistributedAllReduceSum:
    @staticmethod
    def execute[
    dtype: DType, rank: Int, target: StaticString, _trace_name: StaticString,
    ](
    outputs: FusedOutputVariadicTensors[dtype=dtype, rank=rank, ...],
    inputs: InputVariadicTensors[dtype=dtype, rank=rank, ...],
    signal_buffers: MutableInputVariadicTensors[dtype=DType.uint8, rank=1, ...],
    dev_ctxs: DeviceContextList,
    ) capturing raises:
    comptime num_devices = inputs.size
    # ... use dev_ctxs[i] per device ...
  • Added std.gpu.host.CompletionFlag, a non-owning handle to an MLRT M::Driver::CompletionFlag (an 8-byte slot in pinned host memory mapped into a device's address space). Pairs with the new DeviceStream.wait_for_host_value(flag, value) method, which stalls the stream until the flag's 64-bit slot equals the given value. Corresponds to CUDA's cuStreamWaitValue64 and captures cleanly into a CUDA graph as a wait-value node, letting a CPU thread (or an AsyncRT worker dispatched by enqueue_host_func()) gate a GPU stream on host-produced data without a second stream or a blocking host-function callback. Currently CUDA-only; other backends raise.

  • Added a DevicePointer struct, a host-side representation of a pointer to device memory that holds a reference to the owning DeviceBuffer and performs bounds checking.

  • Added a max_single_allocation_size query to DeviceContext that reports the largest single allocation the driver will currently service; on Metal it reflects the live Metal framework limit, while CUDA/HIP report available memory.

  • Added a PDLLevel.ON named constant as an alias for PDLLevel(1), for use in place of the numeric PDLLevel(0)/PDLLevel(1) forms.

Layout and coordinates

  • Changed Idx to a comptime alias for ComptimeInt. Use Idx[value] instead of Idx[value]() for compile-time coordinates.

  • Coord, coord(), Idx, ComptimeInt, and related coordinate helpers now live in the standard library module std.utils.coord. The layout.coord module re-exports the same symbols for layout and kernel code; layout also hoists the common names at package scope for convenience.

  • Kernel coordinate APIs now use Coord to preserve compile-time static shape information:

    • elementwise() now passes a Coord to its callback instead of an IndexList[rank], and accepts a Coord shape argument, letting you pass static dimensions. Rewrite callbacks from def func[width, rank, align](idx: IndexList[rank]) to def func[width, align](coord: Coord), and calls from elementwise(func, IndexList[2](...), ctx) to elementwise(func, Coord(...), ctx).

    • Int now conforms to CoordLike, so Int values can be passed directly to Coord constructors without wrapping them in Idx(...).

  • Added nested-layout support (CuTe layout algebra) to Layout and TileTensor. A single .tile[] API now handles both flat and nested parent layouts, and new row_major_nested()/col_major_nested() constructors (plus RowMajorNestedLayout/ColMajorNestedLayout aliases) build re-nested layouts for MFMA register tiles and blocked_product outputs.

  • Added TileTensor.copy_from() and TileTensor.split() for copying between compatible tile views and splitting tiles into static or runtime-sized partitions.

Device targeting and hardware support

  • Added has_nvidia_gpu_accelerator[subarch] and has_nvidia_gpu_accelerator[subarchs] overloads in std.sys.info that combine compile-time and runtime checks for whether the host has an NVIDIA GPU of a given subarchitecture or newer.

  • Added a new std.sys.machine module providing MachineDefinition, along with expanded DeviceSpec and DeviceRef types, for aggregating accelerators and supplying richer static device information during compilation.

  • Added support for the fp8e4m3, fp8e4m3fn, and fp8e5m2 floating-point types on the Metal (Apple GPU) backend, and enabled native Int <-> bfloat16 conversion on Apple M2 (Apple8) Metal GPUs.

  • math.log() for Float32 on NVIDIA GPUs now uses the lg2.approx.ftz.f32 PTX intrinsic, which flushes subnormal inputs and outputs to zero (matching CUDA's __logf) and avoids the slower denormal-handling path.

Tooling changes

  • The mojo package command has renamed to mojo precompile. Similarly, the .mojopkg file extension has been deprecated; favor the .mojoc file extension instead.

    # Before
    mojo package my_package -o my_package.mojopkg

    # After
    mojo precompile my_package -o my_package.mojoc
  • mojo precompile now produces significantly smaller .mojoc packages by dropping a redundant serialized copy of each module's parser output, which also reduces package compile and load time.

  • Added mojo --print-cache-location and mojo --clear-cache for inspecting and clearing the on-disk Mojo compile cache (.mojo_cache). The resolved path honors the existing precedence (MODULAR_CACHE_DIR, MODULAR_HOME, MODULAR_DERIVED_PATH, XDG_CACHE_HOME, etc.). --clear-cache prompts for confirmation by default; pass -f (or --force) to skip the prompt for scripting use.

    $ mojo --print-cache-location
    /home/you/.cache/modular/.mojo_cache

    $ mojo --clear-cache
    This will remove the Mojo compile cache at:
    /home/you/.cache/modular/.mojo_cache
    Proceed? [y/N] y
    Removed /home/you/.cache/modular/.mojo_cache

    $ mojo --clear-cache -f # no prompt
  • Importing a Mojo module from Python no longer fails when the module lives in a read-only directory (for example, a Mojo extension installed into a read-only site-packages). Previously the importer always tried to write its compiled artifacts to a __mojocache__ directory next to the source, which raised an OSError on a read-only file system. The importer now keeps that in-tree behavior when the source directory is writable, and otherwise redirects the cache to the Modular cache folder. That location honors the standard Modular configuration: the cache_dir key in modular.cfg, the MODULAR_CACHE_DIR and MODULAR_HOME environment variables, and the XDG base directory specification.

  • The mojo compiler will now print the filename and line number in diagnostics that point to inaccessible source locations (for example, from precompiled libraries) instead of a location at the top of the main file:

    # Before
    $> mojo example.mojo
    /path/to/example.mojo:33:16: error: invalid call to '__setitem__': violated constraint
    vec[base + i] = values[i].cast[dtype]()
    ~~~^~~~~~~~~~

    /path/to/example.mojo:1:1: note: constraint declared here evaluated to False, expected 'mut'
    from std.algorithm.functional import elementwise
    ^
    /path/to/example.mojo:1:1: note: function declared here
    from std.algorithm.functional import elementwise
    ^

    # After
    $> mojo example.mojo
    /path/to/example.mojo:33:16: error: invalid call to '__setitem__': violated constraint
    vec[base + i] = values[i].cast[dtype]()
    ~~~^~~~~~~~~~

    max/kernels/src/layout/layout_tensor.mojo:2092: note: constraint declared here evaluated to False, expected 'mut'
    max/kernels/src/layout/layout_tensor.mojo:2090: note: function declared here
  • The mojo compiler now provides more useful diagnostics in the case that source information is unavailable by synthesizing a declaration and pretty-printing it.

    For example, instead of the following, with no contextual information after the 'here':

    /path/to/file.mojo:2092: note: function declared here:

    The user will now see:

    /path/to/file.mojo:2092: note: function declared here:
    def __setitem__[*Tys: Indexer](self, *args: *Tys.values, *, val: SIMD[dtype, Self.element_size]) where mut

    The coverage and quality of diagnostics in such cases will continue to improve in subsequent releases.

  • The Mojo compiler now reports call-related errors on the operand value that causes the failure, instead of on the call overall. This makes it easier to understand failures in calls with many arguments spread over multiple lines.

  • Improved the clarity and actionability of a wide range of compiler diagnostics—declaration resolution, main(), parser, lexer, signature, and call-emission errors—explaining what is wrong and how to fix it.

  • Improved diagnostics for splatting a VariadicPack into a fixed-arity callee: the compiler now attaches a hint pointing to the supported dispatcher pattern, and a callee(*pack) pack-unpack mismatch now reports both the actual and expected element types instead of only the pack trait.

  • MODULAR_DEBUG=uninitialized-read-check failures now print the kernel source location, dtype, lane index, observed bit pattern, and block/thread indices of each trapping lane, instead of being silenced by the thread-(0,0,0) print gate.

  • mojo format now accepts the bare move-capture form {name^} in closure capture lists. Previously only the equivalent {var name^} form round-tripped through the formatter.

  • The Mojo language server now returns ContentModified instead of InvalidRequest for completion requests that arrive during a reparse, fixing missing completions in clients such as Neovim's built-in LSP client.

  • The LLDB debugger now provides type summaries for Mojo's PythonObject (showing the underlying Python type name and decoding common built-ins such as None, bool, int, float, and item counts for list/tuple/dict) and for Dict[K, V] (showing (size N) and exposing live entries in insertion order with their keys and values).

Removed

  • The legacy fn keyword now produces an error instead of a warning. Please move to def.

  • The DeviceContextPtr and DeviceContextPtrList types have been removed from std.runtime.asyncrt. Custom-op execute methods now take DeviceContext directly (or Optional[DeviceContext] where the context is genuinely optional), and multi-device ops take DeviceContextList[N] (see the new entry under Library changes). The helpers get_device_context() and get_optional_device_context() are no longer needed — pass the DeviceContext through directly. The CpuDeviceContext runtime always supplies a real context for the CPU path, so the nullable wrapper is no longer required.

    # Before
    from runtime.asyncrt import DeviceContextPtr, DeviceContextPtrList

    @compiler.register("my_op")
    struct MyOp:
    @staticmethod
    def execute[target: StaticString](
    output: OutputTensor,
    input: InputTensor,
    ctx: DeviceContextPtr,
    ) raises:
    var gpu_ctx = ctx.get_device_context()
    ...

    # After
    from gpu.host import DeviceContext

    @compiler.register("my_op")
    struct MyOp:
    @staticmethod
    def execute[target: StaticString](
    output: OutputTensor,
    input: InputTensor,
    ctx: DeviceContext,
    ) raises:
    ...
  • Removed DeviceContext.compile_function_unchecked() and DeviceContext.enqueue_function_unchecked(). Use the checked compile_function() and enqueue_function() instead.

  • Several parameters and overloads were removed from elementwise and the reduction APIs:

    • The use_blocking_impl parameter has been removed from elementwise (in std.algorithm.functional), and the analogous single_thread_blocking_override parameter from the reduction APIs (reduce, max, min, sum, product, mean in std.algorithm.reduction). These operations now always dispatch work the same way, with a single worker used automatically when the problem size is small, so the blocking variants are no longer needed.

    • The pdl_level parameter has been removed from elementwise, dual_elementwise, and the GPU elementwise implementations. PDL level 1 is now always used.

    • The two Optional[DeviceContext] overloads of elementwise (in std.algorithm.functional) have been removed; callers now thread a non-optional DeviceContext through directly. The CpuDeviceContext runtime always supplies a real context for the CPU path, so the nullable wrapper is no longer needed.

  • The deprecated free-function reflection API in std.reflection has been removed. Use the unified reflect[T] API instead; see Reflection under Library changes for the full migration table.

  • Several previously-deprecated APIs have been removed:

    • The constrained[cond, msg]() function. Use comptime assert cond, msg instead.

    • The Int-returning overload of normalize_index(). Use the UInt-returning overload (or write the index arithmetic inline, for example x[len(x) - 1]).

    • The default UnsafePointer() null constructor. To model a nullable pointer use Optional[UnsafePointer[...]]. For a non-null placeholder for delayed initialization, use UnsafePointer.unsafe_dangling().

  • The -kgenModule flag has been removed from mojo precompile. It emitted a serialized KGEN module (.mlirbc) instead of a .mojoc package and was only used internally.

Fixed

  • Fixed a GPU reduction correctness bug that produced wrong results for a contiguous last-axis reduction (for example mean over the last axis) once the number of rows reached 256 * sm_count (37888 rows on a 148-SM GPU). An N-D reduction is normalized to a rank-3 (outer, reduce, inner) shape, so a last-axis reduction has a trailing inner == 1 dimension; the kernel launcher treated that as a non-contiguous reduction and, once the device was thread-saturated, dispatched a kernel whose cross-row SIMD packing is only valid when a real inner dimension supplies the adjacent rows. Contiguity is now derived from the layout (the reduce dimension is innermost whenever every dimension after it is unit-sized).

  • Reduced the virtual address space reserved by every mojo invocation by ~1 GiB. The JIT memory mapper's reservation granularity was 1 GiB, so each fresh reservation was rounded up to that size and mmapped PROT_READ|PROT_WRITE, inflating VmPeak and counting against Linux RLIMIT_AS. This caused non-deterministic OOM crashes in libKGENCompilerRTShared.so when two mojo processes ran concurrently on memory-constrained CI runners (for example GitHub Actions free-tier, 7 GiB). The granularity is now 64 MiB; large compiles still work because the mapper reserves additional slabs on demand. (Issue #6433)

  • Attempting to import a source Mojo package from a broken symlink will no longer result in a compiler crash. (Issue #6424)

  • A bug preventing from . import module with a spurious recursive-reference error has been fixed.

  • MODULAR_NVPTX_COMPILER_PATH is now part of the Mojo cache location, so that switching to a different ptxas no longer reuses CUBIN cache entries that were generated before the switch. (Issue #6549)

  • Fixed a lifetime-checker bug where destroying a type that captures origins (its destructor can access those origins) failed to extend the referenced value's lifetime beyond the __del__ call.

  • Fixed linear/no-return function interaction so that read arguments and other values are no longer required to be live after a no-return call (for example abort()), reducing code size and eliminating spurious linear-type errors.

  • Fixed several compiler crashes and miscompiles in parameter inference and trait casting, including type-value convertibility incorrectly rejected across an upcast and a Downcast between traits dropping the original type's trait conformance.

  • Fixed several trait-inheritance and conformance bugs: refining traits that inherited a defaulted associated alias (whose default referenced Self, or that was declared abstract by the refined trait) were rejected or crashed; a where conforms_to(T, Trait) clause did not propagate to later parameter matching; loading trait functions from bytecode could crash; a precompiled- package stub closure trait cached before its full definition produced a "closure trait missing call" crash; and passing a struct with a compatible __call__ method to a function-trait parameter now auto-conforms or produces a proper error instead of crashing. (Issue #6354)

  • Fixed a bug where passing a function literal to a parameter typed as a sugared closure trait (for example comptime CallbackType = def(Int) -> Int) failed to inflate the literal to the requisite trait conformer.

  • Fixed a bug where passing a struct larger than 16 bytes to a Mojo callback decorated with abi("C") failed on targets like x86-64 because the callback was missing the required byval flags. (Issue #6511)

  • Fixed a code-generation bug where an indirect tail call to an abi("C") function returning a struct via sret could read uninitialized memory; the tail attribute is now correctly removed for such indirect calls.

  • The compiler now prefers a .mojoc module over a stale .mojopkg of the same name when both live side by side in a directory, avoiding errors from picking up an older build.

  • Fixed misdiagnosis when the compiler failed to synthesize an implicit copy constructor because a field is not ImplicitlyCopyable, and corrected the conditional-conformance backup path to check ImplicitlyCopyable rather than Copyable.

  • Re-enabled error reporting in the elaborator that had previously been disabled.

  • Splatting a non-VariadicPack value (for example print(*l) where l is a List[Int]) now emits a clear error instead of crashing the parser. (Issue #6350)

  • Control-flow statements (if, for, while, try, with, and their comptime forms) placed directly in a struct, trait, or extension body, or at module scope, now emit a parse error instead of crashing the compiler.

  • Fixed a crash in mojo doc when emitting diagnostics for declarations without valid source locations (for example, from bytecode packages).

  • Fixed TileTensor.raw_store() not forwarding its width parameter to the underlying UnsafePointer.store() call.

  • Fixed a POP-to-LLVM lowering failure ("existing function with conflicting signature") that occurred when a graph composed both an external cubin/PTX launch via enqueue_function() and a matmul; the launch path now casts grid/block dimensions, shared-memory size, and attribute count to UInt32 to match the C ABI.

  • MODULAR_DEBUG=uninitialized-read-check no longer produces false positives for fp8 dtypes, whose legitimate saturate-to-max values were bit-identical to the poison sentinel; the fill and load-site check are now skipped for all fp8 formats (fp16, bf16, fp32, and fp64 are unaffected).

  • Fixed isinf() for finite-only float8 dtypes (float8_e4m3fn, float8_e8m0fnu), which previously fell through to an llvm.is.fpclass intrinsic with no i8 overload and failed during LLVM lowering; isinf() now correctly returns False for these formats at compile time.

  • Fixed several Apple GPU (Metal) backend code-generation bugs: illegal generic-address-space accesses when unpacking an OptionalReg[T] containing pointer fields, bfloat16 arithmetic on M1 (Apple7) and M2 (Apple8) GPUs that lack native support, and an unlowerable max() intrinsic on SIMD float vectors.

  • Fixed AMD RDNA GPU architecture detection at compile time by removing the amdgpu: prefix from the AMD RDNA GPU architecture patterns, which had caused compilation to fail on AMD RDNA GPUs.

  • UnsafePointer implicit constructor has been fixed. When a function took an UnsafePointer[mut=False, ...], and was passed a mutable pointer, the incorrect constructor was chosen from overload resolution resulting in the new origin being ImmutableAnyOrigin. This is an issue as it occasionally hid mutability aliasing between pointers and hid some unused variables. The constructor now correctly casts to ImmutOrigin(Self.origin).

Special thanks

Special thanks to our community contributors:

Adam Kruger (@lightofbaldr), Deftera (@Deftera186), Dylan Stark (@dylan-stark), Gabriel de Marmiesse (@gabrieldemarmiesse), Giorgos Smyridis (@gsmyridis), Jongmin Park (@GzuPark), Keven Villeneuve (@kevenv), Mahendra Singh Rathore (@mahendrarathore1742), Manuel Saelices (@msaelices), martinvuyk (@martinvuyk), Mose Schmiedel (@moseschmiedel), Olcmyk (@Olcmyk), Piper (@piperchester)