Mojo v1.0.0b1
✨ Highlights
Documentation
-
Compilation targets docs instructs how to inspect your current platform, select a target configuration, and generate code for that target. Use it to build for your own system or target other CPUs, operating systems, and accelerators.
-
Mojo language reference covers lexical elements, expressions, statements, numeric types, struct declarations, trait declarations.
-
Functions reference page improves discoverability of new function features.
-
Split operators manual into separate pages; refreshed coverage and added tutorial, operator tests, and new reference page.
-
Negative examples and errors added to reference pages highlight sharp edges of the language.
-
MLIR reference page introduces inline MLIR to developers in Mojo code.
-
Adds docs for non-nullable pointers and provides sample code showing how to use
OptionalwithUnsafePointer. -
Separated the Mojo layout library docs from the MAX kernels library, reflecting that the layout library ships with the
mojopackage while the rest of the kernels library ships with themaxpackage. -
Added a new guide for building Mojo packages, currently covering the
rattler-buildworkflow. -
Restructured the Mojo and MAX system requirements and GPU compatibility docs. Replaced the three-tier GPU support model with a two-level "Continuously tested" / "Known compatible" taxonomy, added a dedicated Mojo GPU compatibility page with per-vendor hardware tables and driver requirements, and simplified the main requirements pages.
-
Added a Mojo manual reference page for the
@doc_hiddendecorator.
Language enhancements
-
Added type refinement based on compile time assumptions, enabling Mojo to narrow types from
whereclauses,comptime ifstatements, andcomptime assertstatements. Refinements in a scope are driven byconforms_to()expressions.Before:
def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):for item in self:if trait_downcast[Equatable](item) == trait_downcast[Equatable](value):return Truereturn FalseAfter:
def __contains__(self, value: Self.T) -> Bool where conforms_to(Self.T, Equatable):for item in self:if item == value:return Truereturn False -
Improved diagnostics for onboarding-priority parser errors in Mojo for clarity and UX.
-
Updated signature error diagnostics and added related tests.
-
Mojo now uses
NoneTypeinstead of an empty tuple to mark constructor using literals. -
The ternary
if/elseexpression now coerces each element to its contextual type when it is obvious. For example, this works instead of producing an error about incompatible metatypes:comptime some_type: Movable = Int if cond else String -
Unified closure improvements. This release continues the closure unification work begun in earlier releases: stateless closures auto-lift, the
refcapture convention is supported, default capture conventions can be combined with explicit capture lists, and a newthinfunction effect distinguishes function pointer types from closure traits.def main() raises:var a, b, c, d = 1, 2, 3, 4var x = "hello"# Legacy closure: no capture list. Cannot capture variables.def hello():print("hi")# Unified closure with no captures (stateless). Stateless closures# lift to top-level functions and can be passed as FFI callbacks.def add_one(n: Int) {} -> Int:return n + 1# Unified closure with explicit captures and a default capturing# convention:def my_fn() {mut a, b, c^, read}:# capture:# `a` by mut reference# `b` by immut reference# `c` by moving# `d` by immut reference (the default `read` convention)use(a, b, c, d)# Unified closure that captures `x` by ref (carries an# origin-mutability parameter):def show_x() {ref x}:print(x)# Function effects come before the capture list. The calling context# must handle errors raised from a `raises` closure.def fallible() raises {}:raise Error("nope")# Closures are invoked like ordinary functions:hello()print(add_one(41))my_fn()show_x()try:fallible()except e:print(e)# The `thin` function effect declares a function pointer type# (distinct from a closure trait). Stateless closures and top-level# functions satisfy `thin`:var fn_ptr: def(Int) thin -> Int = add_oneprint(fn_ptr(99)) -
Added
abi("C")as a function effect for declaring C calling convention on function definitions and function pointer types. Functions marked withabi("C")use the platform C ABI (System V x86-64 / ARM64 AAPCS) for struct arguments and return values, enabling safe interop with C libraries:# C-ABI function definition (safe as a callback into C code)def add(a: Int32, b: Int32) abi("C") -> Int32:return a + b# C-ABI function pointer type (safe for use with DLHandle.get_function)var f = handle.get_function[def(Float64) abi("C") -> Float64]("sqrt")DLHandle.get_function[]now enforces that the type parameter carriesabi("C"), preventing silent ABI mismatches when loading C symbols. -
String literals now support
\uXXXXand\UXXXXXXXXunicode escape sequences, matching Python. The resulting code point is stored as UTF-8. Invalid code points and surrogates are rejected at parse time. -
Added support for conditional
RegisterPassableconformance. -
Variadic lists and packs can be forwarded through runtime calls with
*packwhen the callee takes a compatible variadic list/pack.def callee[*Ts: Writable](*args: *Ts):comptime for i in range(args.__len__()):print(args[i])def forwarder[*Ts: Writable](*args: *Ts):callee(*args)forwarder(1, "hello", 3.14) # prints each value on a separate line -
Heterogenous variadic packs can now be specified with a
SomeTypehelper function. These two are equivalent:def foo[*arg_types: Copyable](*args: *arg_types) -> Int: ...def foo(*args: *SomeTypeList[Copyable]) -> Int: ... -
T-strings can now be used in
comptime assertmessages:def foo[i: Int]():comptime assert i > 5, t"expected i > 5, got {i}" -
Added
__mlir_deferred_type[...]for declaring parameter-dependent return types and operation result types in inline MLIR. The parser combines the bracketed pieces into a type string and the elaborator builds the concrete MLIR type once parameters are substituted.
Language changes
-
Variadic parameters lists are now passed instead of
ParameterListandTypeListinstead of!kgen.param_list. This makes it much more ergonomic to work with these types, e.g. simple logic just works:def callee[*values: Int]():var v = 0for i in range(len(values)):v += values[i]for elt in values:v += eltSimilarly, the
ParameterList/TypeListstructs have other methods for transforming the value list. As such, a variety of values from theVariadicstruct have started moving over to being members of these types. -
All Mojo functions now has a unique "function literal type". In practice, it means that:
# type_of(foo) != type_of(bar)def foo(): passdef bar(): pass -
The
fnkeyword for function declarations is deprecated. Mojo now emits a compiler warning on uses offn; this will become a compilation error in the next release. Usedefinstead. -
The
unifiedkeyword has been removed; specify unified-closure semantics with an explicit capture list{...}after the function signature. An empty capture list{}denotes unified with no captures. Closures without any capture list are legacy. Mojo also now warns when a function pointer type omits thethineffect; specifythinexplicitly to silence the warning. -
Removed support for comparing tuples of differing lengths or types. Such comparisons (for example
(1, 2) != (4, 5, 6)) are now rejected statically by the type system instead of silently returning not-equal. -
A if comptime(C) else Bnow skips elaboration of the dead branch, treating the ternary expression as a compile-time evaluation contract analogous tocomptime if C: A else: B. -
@explicit_destroyis now rejected at parse time when paired with an unconditionalImplicitlyDestructibleconformance; it remains valid only on conditional (where-clause-constrained) conformances. -
Import statements of the form
from pkg import ...no longer makepkgavailable to the module.
Library changes
-
Removed explicit
trait_downcast/trait_downcast_varacross the standard library sources, now that Mojo applies type refinement from comptime assumptions. Public APIs are unchanged. Updated files:stdlib/std/builtin/:_stubs.mojo,bool.mojostdlib/std/collections/:deque.mojo,dict.mojo,inline_array.mojo,linked_list.mojo,list.mojo,optional.mojo,set.mojostdlib/std/iter/__init__.mojo,stdlib/std/itertools/itertools.mojostdlib/std/memory/:arc_pointer.mojo,owned_pointer.mojo,span.mojo
-
Consolidated the reflection APIs in
std.reflectionbehind a unified entry pointreflect[T]()returning aReflected[T]handle.reflectis auto-imported via the prelude, so it is available without an explicit import. Methods on the handle replace the family ofstruct_field_*free functions (dropping thestruct_prefix — only structs have fields) and theget_type_name/get_base_type_namefree functions:struct Point:var x: Intvar y: Float64def main():comptime r = reflect[Point]()print(r.name()) # "Point"print(r.field_count()) # 2print(r.field_names()[0]) # xcomptime y_type = r.field_type["y"]() # Reflected[Float64]print(y_type.name()) # "SIMD[DType.float64, 1]"print(reflect[List[Int]]().base_name()) # "List"var v: y_type.T = 3.14Methods on
Reflected[T]:name[qualified_builtins=],base_name,is_struct,field_count,field_names,field_types,field_index[name],field_type[name],field_offset[name=]/[index=], andfield_ref[idx](s). Thefield_type[name]()method returns aReflected[FieldT], so reflection is fully composable.The legacy free functions —
struct_field_count,struct_field_names,struct_field_types,struct_field_index_by_name,struct_field_type_by_name,struct_field_ref,is_struct_type,offset_of,get_type_name,get_base_type_name— and theReflectedType[T]wrapper are now@deprecatedand delegate to the new API. They will be removed in a future release. -
Added
struct_field_ref[idx, T](ref s)tostd.reflectionfor accessing struct fields by index without copying. The function returns a reference with the same mutability assand works with both concrete and generic struct types, including parametric indices incomptime forloops. The default implementations ofHashable,Equatable, andWritablenow use this library function instead of the__struct_field_refmagic. -
The
Boolable,Defaultable, andWritabletraits no longer inherit fromImplicitlyDestructible. Generic code that relied on receiving the destructor bound transitively through these traits must now spell it out explicitly, for exampleT: Writable & ImplicitlyDestructible. -
The
Variadicsuite of low-level operation has been refactored and migrated to being members of theTypeListandParameterListtypes, making them more ergonomic to work with and more accessible. -
Atomic operations have moved to a dedicated
std.atomicmodule. TheConsistencytype has been renamed toOrderingand itsMONOTONICmember has been renamed toRELAXEDto align with conventions used by other languages. Update existing code as follows:# Beforefrom std.os import Atomicfrom std.os.atomic import Atomic, Consistency, fence_ = atom.load[ordering=Consistency.MONOTONIC]()# Afterfrom std.atomic import Atomic, Ordering, fence_ = atom.load[ordering=Ordering.RELAXED]() -
Added
Optional.destroy_with(destroy_func), which destroys anOptional[T]in-place using a caller-provided destructor. This enablesOptionalto hold element types that are notImplicitlyDestructible(for example, types marked@explicit_destroy), mirroringVariant.destroy_with. BothOptional.destroy_withandVariant.destroy_withnow accept closures that capture local state in addition to plain function references. Note thatVariant.destroy_withcallers must now pass the destroyed type explicitly (for example,v^.destroy_with[Int](destroy_func)) sinceTcan no longer be inferred from the closure type. -
assert_raisesnow catches customWritableerror types, not justError. -
Added UAX #29 grapheme cluster segmentation to
StringandStringSlice. New APIs:graphemes()returns aGraphemeSliceIterthat yields each user-perceived "character" as aStringSlice, andcount_graphemes()returns the grapheme cluster count. This correctly handles combining marks, emoji ZWJ sequences, flag emoji, Hangul syllables, and other multi-codepoint clusters. -
StringSlicenow supports slicing by grapheme cluster via thegrapheme=keyword argument, mirroring the existingbyte=indexer. For example,s[grapheme=0:3]returns aStringSlicecovering the first three grapheme clusters, ands[grapheme=i:i+1]extracts the i-th grapheme. Out-of-range ends are clamped to the end of the string; negative indices are not supported. Because grapheme boundaries are discovered by a forward scan, this operation is O(n) in the byte length — preferbyte=slicing when you already have byte offsets. -
GraphemeSliceIterexposes a newremaining_byte_length()accessor that reports the byte length of the iterator's remaining range in O(1). This lets callers compute how many bytes the iterator has produced so far without summing per-grapheme byte lengths, and is now used byStringSlice[grapheme=...]for an ~1.4x speedup on ASCII-heavy inputs. -
GraphemeSliceIternow supports reverse iteration.next_back()andpeek_back()return the last grapheme cluster in the remaining range, andStringSlice.graphemes_reversed()/String.graphemes_reversed()return aGraphemeSliceIterwhosefor-loop iteration walks clusters from end to start.next()andnext_back()can be interleaved on the same iterator. Reverse iteration costs more per cluster than forward iteration because the UAX #29 state machine is inherently forward-scanning:next_back()backs up to a guaranteed grapheme boundary (the start of the string or a Control/CR/LF codepoint) and rescans forward. The safe boundary is cached across reverse calls — a forwardnext()invalidates it — so per-call cost is dominated by forward-scan length: small in text containing line breaks or whitespace, growing with the distance back to such a codepoint in long runs without them. -
Added grapheme-aware algorithms on
StringandStringSlice:grapheme_indices()returns aGraphemeIndicesIterthat yields(byte_offset, grapheme)pairs, mirroring Rust'sstr::grapheme_indices. Useful for text editors or UIs that need to map cursor byte positions back to grapheme boundaries.nth_grapheme(n)returns then-th grapheme cluster as anOptional[StringSlice], orNonewhennis out of range.split_at_grapheme(n)returnsTuple[StringSlice, StringSlice]holding the prefix[0, n)and suffix[n, count)of grapheme clusters in a single pass, clampingnto the total count.
-
count_graphemes()now takes a fast path over runs of printable ASCII (U+0020..U+007E). Each such byte has GBPOtherand two consecutive safe-ASCII bytes always have a grapheme-cluster break between them (GB999), so safe-ASCII runs can be counted at one grapheme per byte without entering the UAX #29 state machine. On pure-ASCII text this is roughly 10x faster (~0.38 ms vs. ~3.85 ms for 1 MB of English), and ~5-6x faster on ASCII-dominant mixed text (Spanish UN charter). Pure non-ASCII text (Arabic, Russian, Chinese) is unchanged. -
Variadics of types have been moved to the
TypeListstruct. One can write operations such as:comptime assert TypeList[Trait=AnyType, Int, String]().contains[Bool] -
abort(message)now includes the call site location in its output. The location is automatically captured and printed alongside the message. You can also pass an explicitSourceLocationto override it:abort("something went wrong")# prints: ABORT: path/to/file.mojo:42:5: something went wrongvar loc = current_location()abort("something went wrong", location=loc) -
abort(message)now prints its message on Nvidia and AMDGPU, including block and thread IDs. Previously, the message was silently suppressed on these GPUs. On Apple GPU, the message is silently suppressed for now. -
SourceLocationfields (line,col,file_name) are now private. Use the new accessor methodsline(),column(), andfile_name()instead. -
Fixed default alignment in
TileTensor.load()andTileTensor.store()to use the caller-specifiedwidthparameter instead ofSelf.element_size. -
Added uninitialized memory read detection for float loads. When compiled with
-D MOJO_STDLIB_SIMD_UNINIT_CHECK=true, every float load is checked against the debug allocator's poison pattern (the largest finite value of the float type, e.g.FLT_MAXforFloat32). A match triggersabort()with a descriptive message. The non-NaN poison pattern letsnan-checkanduninit-read-checkcoexist (a NaN poison would be flagged bynan-checkas a legitimate NaN error in kernels that intentionally only write active positions). When disabled (the default), zero runtime overhead. For MAX pipelines, setMODULAR_MAX_DEBUG_UNINITIALIZED_READ_CHECK=trueto enable both the debug allocator and the load-time checks automatically. -
Expanded Apple Silicon GPU support. Apple Metal GPU is now a more capable Mojo target.
print()and_printf()now work on Apple Metal GPU. Output is chunked through the Metalos_logpath, with a Float32-only formatter that matches Metal's hardware constraints._printf()currently emits the format string only (not interpolated arguments);|x| < 1e-7is truncated to0.0and there is no scientific notation.external_memory[]()(dynamic threadgroup memory) is now supported on Apple Silicon. The compiler and runtime bridge CUDA-style extern shared symbols to Metal'ssetThreadgroupMemoryLength:atIndex:model, so existing GPU kernels usingexternal_memory[]()work unchanged.- Apple M5 MMA intrinsics (
apple_mma_load,apple_mma_store,_mma_apple) instd.gpu.compute.arch.mma_appleenable hardware matrix multiply-accumulate on Apple GPU. - Added
CompilationTarget.is_apple_m5()tostd.sysfor detecting Apple M5 targets at compile time;is_apple_silicon()now includes M5 in its check. - Apple GPU targets now prefer
metal4features by default when the toolchain supports them, automatically appending-metal4to the arch instead of requiring explicitm5-metal4selection. - Atomic ordering:
releaseordering is not supported on Metal. Apple GPU targets now usemonotonic(relaxed) atomic ordering by default. - Floating-point widths: the compiler now rejects floating-point types
wider than 32 bits (
Float64/Float80/Float128) for Apple GPU targets, since Metal supports onlyFloat16andFloat32.
-
Standard library types now use conditional conformances, replacing previous
_constrained_conforms_tochecks:Span:Writable,HashableTuple,Optional,Variant, andUnsafeMaybeUninit:RegisterPassableVariant:Copyable,ImplicitlyCopyableOptional:DevicePassable(conditional on element type)
-
Tuplenow conditionally conforms toDefaultable, so genericT: Defaultablecode can default-construct tuples when all element types areDefaultable. -
OwnedDLHandle.get_symbol()now returnsOptional[UnsafePointer[...]]instead of aborting when a symbol is not found. This allows callers to handle missing symbols gracefully. -
UnsafePointeris now non-null by design. See the non-null pointer proposal for the full design and migration timeline.The default null constructor
__init__(out self)and__bool__(self)method are now deprecated, andUnsafePointerno longer conforms toDefaultableorBoolable.To migrate, express nullability explicitly with
Optional[UnsafePointer[...]], which has the same layout asUnsafePointer(the null address is theNoneniche) so nullable pointers remain zero-overhead and can be used across C-FFIs.# Before: null default constructionvar ptr = UnsafePointer[Int, origin]()# After: express absence with Optionalvar ptr: Optional[UnsafePointer[Int, origin]] = None# Before: Bool-based null checkif ptr:use(ptr[])# After: check the Optional, then unwrapif ptr:use(ptr.value()[])If you specifically need a non-null placeholder for a field that will be populated later (for example, a buffer that is allocated on demand) use
UnsafePointer.unsafe_dangling(), which returns a well-aligned but dangling pointer. Note thatunsafe_dangling()is not a null sentinel: types that lazily allocate must track initialization separately. -
GPU primitive id accessors (
thread_idx,block_idx,block_dim,grid_dim,global_idx,lane_id,warp_id,cluster_dim,cluster_idx, andblock_id_in_cluster) have migrated fromUInttoInt.This is part of a broader migration to standardize on the
Inttype for all sizes and offsets in Mojo. As a related step in the same migration,TensorCore.load_a()andTensorCore.load_b()now also takeIntarguments instead ofUInt.To provide a gradual migration path, explicitly typed
*_uintaliases of the seven non-cluster accessors are available temporarily:Accessor Legacy UIntaliasthread_idxthread_idx_uintblock_idxblock_idx_uintblock_dimblock_dim_uintgrid_dimgrid_dim_uintglobal_idxglobal_idx_uintlane_idlane_id_uintwarp_idwarp_id_uintThe three cluster accessors (
cluster_dim,cluster_idx,block_id_in_cluster) migrated directly without*_uintaliases, since their usage was limited.Code can preserve its prior
UIntbehavior by using a renaming import of the*_uintalias:- from std.gpu import thread_idx+ from std.gpu import thread_idx_uint as thread_idxThe temporary
*_intaccessors that briefly existed during the phased migration as a forward-compatibility aid have been removed; use the unprefixed accessors (which now returnIntby default). The*_uintaliases will eventually be deprecated and removed as well. -
Added
IterableOwnedtrait to the iteration module. Types conforming toIterableOwnedimplement__iter__(var self), which consumes the collection and returns an iterator that owns the underlying elements.Listnow conforms toIterableOwned.Optionalnow conforms toIterableOwned.Dequenow conforms toIterableOwned.LinkedListnow conforms toIterableOwned.Dictnow conforms toIterableOwned.Setnow conforms toIterableOwned.Counternow conforms toIterableOwned.InlineArraynow conforms toIterableOwned.Spannow conforms toIterableOwned(conditional onT: Copyable). The owned iterator yields copies of elements by value.- Iterator adaptors (
enumerate,zip,map,peekable,take_while,drop_while,product,cycle,count,repeat) now conform toIterableOwned. - Added owned overloads of
enumerate(),zip(),map(),peekable(),take_while(),drop_while(),product(), andcycle()that consume the input iterable.
-
CStringSlicecan no longer represent a null pointer. To represent nullability useOptional[CStringSlice]which is guaranteed to have the same size and layout asconst char*, whereNULLis the emptyOptional. -
external_call'sreturn_type's requirements has been relaxed fromTrivialRegisterPassabletoRegisterPassable. -
Negative indexing on all stdlib collections has been removed to enable cheap CPU bounds checks by default:
ListSpanInlineArrayStringStringSliceLinkedListDequeIntTuple
Using a negative
IntLiteralfor indexing will now trigger a compile-time error, for example:/tmp/main.mojo:3:12: note: call expansion failed with parameter value(s): (..., ...)print(x[-1])^constraint failed: negative indexing is not supported, use e.g. `x[len(x) - 1]` insteadUpdate any
x[-1]tox[len(x) - 1], following the compiler errors to your call sites as above.This does not affect any MAX ops that support negative indexing.
-
Bounds checking is now on by default for all collections on CPU, and will show you the call site in your code where you triggered the out of bounds access:
def main():var x = [1, 2, 3]print(x[3])At: /tmp/main.mojo:3:12: Assert Error: index 3 is out of bounds, valid range is 0 to 2Bounds checking is still off by default for GPU to avoid performance penalties. To enable it for tests:
mojo build -D ASSERT=all main.mojoTo turn off all asserts, including CPU bounds checking:
mojo build -D ASSERT=none main.mojo -
alloc[T](count, alignment)will nowabortif the underlying allocation failed. -
Added
Variadic.contains_valuecomptime alias to check whether a variadic sequence contains a specific value at compile time. -
ArcPointernow conditionally conforms toHashableandEquatablewhen its inner typeTdoes. Both__eq__and__hash__delegate to the managed value, matching C++shared_ptrand RustArcsemantics. This makesArcPointerusable as aDictkey orSetelement with value-based equality. Pointer identity is still available via theisoperator. -
Pathnow conforms toComparable, enabling lexicographic ordering and use withsort(). -
range()overloads that took differently-typed arguments or arguments that wereIntable/IntableRaisingbut notIndexerhave been removed. Callers should ensure they're passing consistent integral argument types when callingrange(). -
Consistencynow has a default constructor that selectsRELEASEordering on Apple GPU andSEQUENTIALon all other targets. AllAtomicmethods andfenceuse this platform-aware default instead of hard-codingSEQUENTIAL. -
NDBufferhas been fully removed. Please migrate toTileTensor. -
Added a generic
__contains__method toSpanfor any element type conforming toEquatable, not justScalartypes. -
Fixed
blocked_productintile_layoutto zip block and tiler dimensions per mode, matching the legacyblocked_productbehavior. -
Added
Span-based overloads forenqueue_copy,enqueue_copy_from, andenqueue_copy_toonDeviceContext,DeviceBuffer, andHostBuffer, providing a safer alternative to rawUnsafePointerfor host-device memory transfers. -
String.__len__()has been deprecated. Prefer to useString.byte_length()orString.count_codepoints(). -
Added
map()andand_then()methods toOptional.map()transforms the contained value by applying a function, returningOptional[To].and_then()chains operations that themselves return anOptional, enabling flat-mapping over fallible computations.var o = Optional[Int](42)def closure(n: Int) {} -> String:return String(n + 1)var mapped: Optional[String] = o.map[To=String](closure)print(mapped) # Optional("43") -
Added
std.memory.forget_deinit()to enable low-level code to skip the usual requirement to run a destructor for a value. This function should be used rarely, when building low-level abstractions. -
CPU
DeviceContextexpansion.DeviceContext(api="cpu")is now usable as a stream-ordered execution context for CPU work, paving the way for NUMA-aware CPU dispatch.- Added
DeviceContext.enqueue_cpu_function()andDeviceContext.enqueue_cpu_range()for stream-ordered execution of host functions on CPUDeviceContextinstances.enqueue_cpu_functionenqueues a single host function;enqueue_cpu_rangeenqueues a parallel range whose tasks run concurrently but are stream-ordered relative to surrounding work. Argument passing is not yet supported. parallelize,parallelize_over_rows(instd.algorithm.backend.cpu.parallelize), and theelementwiseoverloads instd.algorithm.functionalnow accept an optional trailingctx: Optional[DeviceContext] = None. When supplied, the provided CPUDeviceContextis forwarded tosync_parallelize; when omitted, the previous behavior is preserved.- Added a
parallelism_level()overload that takes a CPUDeviceContextand returns the thread-pool size for that specific context, enabling NUMA-specific introspection.
- Added
-
Readable GPU kernel names in profilers. GPU kernels in the standard library and across MAX kernels (elementwise, GEMV, multistage matmul, attention, convolution, MoE, normalization, quantization, BMM, grouped matmul, SM100 matmul, AMD matmul, communication, and sampling) now expose human-readable names in profiler traces such as Nsight Systems, replacing previously mangled KGEN symbols.
-
tile_iomodule forTileTensordata movement. Added atile_iomodule providingTileTensorcopier traits and copy utilities for moving data between memory hierarchies (DRAM/SRAM). The module includes:GenericToSharedAsyncTileCopier, which moves aTileTensorfrom generic memory into shared memory via NVIDIA'scp.async. On AMD and Apple GPUs the underlyingasync_copyfalls back to synchronous loads/stores.- An optional
swizzle: Swizzleparameter onGenericToSharedAsyncTileCopier, mirroring the swizzled write path inLocalToSharedTileCopier. - A
masked: Bool = Falseparameter onGenericToSharedAsyncTileCopier. When enabled, out-of-bounds vectors receive a zero-byte copy with zero-fill, matchingLayoutTensor.copy_from_async[is_masked=True, fill=Fill.ZERO]. - An
AsyncTileCopiertrait abstracting copier conformance.
-
TMA
gather4for sparse 2D tensor loads. Added a TMAgather4operation on SM100 (Blackwell) for loading 4 non-contiguous rows from a 2D tensor in a single TMA instruction, surfaced as thecp_async_bulk_tensor_2d_gather4intrinsic instd.gpu.memoryand integrated withTMATensorTile. The API supports:- Full 2D tile sparse loads with arbitrary
tile_height(multiple of 4) andtile_width, replacing the prior 4-row-per-call limit. - Arbitrary
row_width— previously restricted to the swizzle box width. The API automatically computes the box width from the swizzle constraint and supports non-divisible widths via TMA hardware zero-fill on the last column group, so kernels no longer need to hand-code column-group loops.
- Full 2D tile sparse loads with arbitrary
-
1D TMA instructions for SM90+ NVIDIA GPUs. Added 1D TMA (Tensor Memory Accelerator) instruction support in
std.gpu.memory. 1D TMA copies do not require a pre-allocated tensormap object on the host, providing greater flexibility than the existing 2D–5D TMA path. New functions:cp_async_bulk_shared_cluster_global,cp_async_bulk_global_shared_cta,cp_async_bulk_prefetch, andcp_async_bulk_reduce_global_shared_cta(the 1D counterpart tocp_async_bulk_tensor_reduce_global_shared_cta, which reduces floating-point values from shared memory into global memory; ADD only). -
TileTensorAPI extensions.- Added
TileTensor.bitcast[target_dtype](), which returns a newTileTensorviewing the same storage and layout under a different element dtype, replacing theTileTensor(x.ptr.bitcast[Scalar[T]](), x.layout)idiom. - Added
TileTensor.flat_loadandTileTensor.flat_storeas raw-flat accessors that read and write the underlying storage at a linear offset, bypassing the tensor's layout. - Added a
TileTensor.tile()overload that takes the tile shape as a runtime/parameter argument, complementing the existing tile APIs. - GPU
TileTensor.load()andload_linear()now defaultinvariant=Truefor immutable tensors, enabling the compiler to useldgfor read-only memory accesses. - Added compile-time bounds checks to
TileTensor,ManagedTensorSlice, andcrd2idxto catch out-of-range coordinate accesses at compile time.
- Added
-
Layout library extensions.
- Added a compile-time
coalescefunction forTensorLayout, mirroring the legacyLayout.coalescealgorithm (skip shape-1 dims and merge contiguous dims). - Added
write_repr_totoLayoutfor writing a debug representation to aWriter. vectorizeanddistributenow accept layouts with runtime dimensions.row_majornow accepts coord-like arguments directly, no longer requiring them to be wrapped in tuples.- Introduced weakly compatible layouts, enabling structural compatibility
comparisons between layouts and coordinate indices (up to depth 4).
Structural equality is now checked via a
comptime assertrather than awhereclause. - Changed
CoordLike.value()to returnScalar[Self.DTYPE]instead ofInt, providing a more expressive return type for layout coordinate values. Coord,RowMajorLayout, andColMajorLayoutonce again take their parameters as variadic arguments, improving ergonomics when specifying individual coords. Use*splatto pass an existing list.
- Added a compile-time
-
Several standard library APIs that previously took legacy closures now have unified-closure overloads:
parallelizeandparallelize_over_rows(instd.algorithm.backend.cpu.parallelize),bench.bencher,DeviceContext.execution_time, andDeviceContext.enqueue_function(the GPU enqueue path, renamed from the previousenqueue_closure). -
GPU device APIs.
- Added support for NVIDIA B300 (sm_103a) accelerators. New helpers in
std.sys.infoandstd.gpu.host.inforecognize B300 targets so kernels can dispatch correctly on the Blackwell B300 architecture. - Added
DeviceStream.enqueue_host_func(func, user_data)exposing thecuLaunchHostFuncprimitive for Mojo kernels and custom ops. Takes athin def(OpaquePointer[MutAnyOrigin]) -> Nonecallback and an opaqueuser_datapointer. CUDA-only today; non-CUDA backends raise. DeviceContextinitialization now runs an automatic GPU health check that detects hardware throttling, uncorrectable ECC errors, and zombie VRAM, and fails device creation with an actionable error message on unhealthy GPUs. AddedDeviceContext.run_healthcheck()to re-invoke the check explicitly. SetMODULAR_DEVICE_CONTEXT_DISABLE_HEALTHCHECK=trueto disable.- Optimized GPU
elementwiseindex computation and dispatch with ause_32bitfast path, 4× unrolled grid-stride processing, warp-aligned block sizes, and SM100+ single-tile routing.
- Added support for NVIDIA B300 (sm_103a) accelerators. New helpers in
-
AMD GPU intrinsics.
- Added the
ds_read_tr8_b64AMD GPU intrinsic instd.gpu.intrinsics, performing a 64-bit LDS transpose load of 8-bit elements viallvm.amdgcn.ds.read.tr8.b64. Supported on AMD CDNA4+ GPUs. - Added a
Scalar[dtype]overload ofreadfirstlaneso callers no longer need bitcast workarounds to broadcast non-Int32scalar values across an AMD GPU wavefront. AMDBufferResource.load_to_ldsinstd.gpu.intrinsicsnow lowers to the.ptr.form of the AMDGPU buffer-load-to-LDS intrinsic, fixing a strided-layout regression on MLA layouts wherecache_depth != depthandhead_dim_offset != 0. A newasync_copies: Bool = Falseparameter opts in to attaching theamdgpu.AsyncCopiesalias scope on the load, enabling LLVMvmcntrelaxation.- Added a
broadcast=Trueparameter to GPUwarp_id()(and related id accessors) so callers can avoid manualwarp.broadcast(warp_id())patterns.
- Added the
-
Math, debug, and stdlib APIs.
align_downandalign_upnow accept genericSIMD[dtype, width]integer values, replacing the previousUInt-only overloads.- Extended
FastDivandmulhito support 64-bit integer types, with NVIDIA-specificllvm.nvvm.mulhi.ull/llintrinsics and 128-bit arithmetic on other targets. - Added
check_boundsfor collections that asserts on out-of-range indices and reports the user's call site instead of stdlib source. debug_assertnow accepts acall_locationparameter, allowing callers to override the reportedSourceLocationso assertion errors can point to user code rather than stdlib internals.- Swapped the ordering arguments of
Atomic.compare_exchangesosuccess_orderingis listed beforefailure_ordering, matching the convention used by C++, Rust, and other languages. InlineArray's storage constructor now usesdebug_assert[assert_mode="safe"]for the element-count check, so size mismatches are caught by default instead of only with-D ASSERT=all.
Tooling changes
-
The Mojo debugger now shows a
Variantvariable's active type name and value in LLDB — e.g.Int(42)orString("hello")— instead of exposing raw_DefaultVariantStorageinternals. -
The Mojo debugger now displays scalar types (e.g.
UInt8,Float32) as plain values instead of([0] = value), and elides internal_mlir_valuewrapper fields from struct display. -
mojo formatno longer supports the deprecatedfnkeyword, nor the removedownedargument convention. -
Comptime function calls now print more nicely in error messages and generated documentation, not including
VariadicList/VariadicPackand including keyword argument labels when required. -
LSP and REPL responsiveness.
- Code completion and signature help in REPL/notebook contexts are now amortized O(1) per request by caching parsed prior cells across requests, eliminating quadratic O(N²) slowdown in long sessions.
- LSP parse time for files with docstring code blocks (e.g.
dict.mojo) is roughly 2× faster, using signature-only resolution for transitive dependencies inside docstring code blocks. - LSP parse time is further reduced by deferring body resolution of imported bytecode declarations and resolving named imports lazily, avoiding eager pulls of large transitive dependencies.
-
mojoCLI and toolchain.mojo --versionnow prints a semantic Mojo version (for example,1.0.0...) instead of an internal build identifier, and the same version is used wherever the compiler performs version checks.mojo build --print-supported-targetsnow lists registered targets sorted alphabetically, with a graceful empty-list message.- The compiler now selects the target's baseline CPU when cross-compiling
with
--target-triplewithout--target-cpuand the host and target architectures differ. - ASAN-instrumented Mojo binaries on macOS now use
llvm-symbolizerinstead ofatos, so stack traces report the full inlined call chain through user functions.
-
mojo docand docstring validation.mojo docnow preserves parameterized type names (for exampleList[K],Optional[V],UnsafePointer[Scalar[dtype]]) in the API doc JSON"type"fields, instead of emitting only the bare base name.mojo docnow emits a diagnostic when a public Mojo module has no module-level docstring and-mojo-diagnose-missing-doc-stringsis active. Private modules and modules nested inside private packages are exempt.- Docstring validation no longer requires inferred parameters (those
before
//in a parameter list) to be documented; documenting them remains valid. - Docstring validation now accepts
!and?as valid sentence-ending punctuation for summaries, section bodies, and argument descriptions. def ... raisesfunctions now require aRaises:docstring section like any other raising function, and theisDeffield has been removed frommojo docJSON output.
-
Debugger UX.
- The Mojo debugger now displays
Optional[T]variables asNoneorSome(value)in LLDB instead of exposing raw_DefaultVariantStorageinternals. - The Mojo debugger now correctly displays
UnsafePointer[T]values in LLDB for all pointed-to types, including signed integers (no longer rendered as huge unsigned values),Bool(True/False), and floats. - The Mojo debugger now displays
StringSlice,StaticString, and their underlyingSpan[Byte]values as quoted strings in LLDB. - At
-O0, trivially destructible types (Int,Float,Bool,SIMD, etc.) now remain visible in the debugger through the end of their lexical scope instead of disappearing at the ASAP destruction point.
- The Mojo debugger now displays
-
mojo format(mblack) now correctly parses the new unified-closure syntax includingraises {captures}effect ordering, and no longer inserts a spurious space between^and the operand invar^captures. -
Mojo package files (
.mojopkg) now use format version 2 with zstd-compressed MLIR bytecode, significantly reducing package, wheel, and Docker image sizes. -
Added a
--mojo-versionflag tomojo-lsp-serverfor verifying the Mojo version that the LSP is using. -
Removed the legacy
MOJO_ENABLE_STACK_TRACE_ON_ERRORandMOJO_ENABLE_STACK_TRACE_ON_CRASHenvironment variables. Instead, set theMODULAR_DEBUGenvironment variable tostack_trace_on_errorto enable generation of stack traces when a Mojo program raises an error.
GPU programming
- Added support for AMD MI250X accelerators.
❌ Removed
-
The
escapingfunction effect is no longer supported. Migratedef(...) escaping -> Tclosures to use an explicit capture list{...}(see the closure refactor entry under Language enhancements). -
The deprecated
@doc_privatedecorator has been removed. Use@doc_hiddeninstead. -
Removed the
store_release,store_relaxed,load_acquire, andload_relaxedhelpers fromstd.gpu.intrinsics. UseAtomic[dtype, scope=...].storeandAtomic[dtype, scope=...].loadwith the desiredOrderinginstead:# Beforefrom std.gpu.intrinsics import store_release, load_acquirestore_release[scope=Scope.GPU](ptr, value)var v = load_acquire[scope=Scope.GPU](ptr)# Afterfrom std.atomic import Atomic, OrderingAtomic[dtype, scope="device"].store[ordering=Ordering.RELEASE](ptr, value)var v = Atomic[dtype, scope="device"].load[ordering=Ordering.ACQUIRE](ptr) -
Several constructs deprecated in 26.2 are no longer accepted:
- The
@register_passableand@register_passable("trivial")decorators are no longer supported. Conform to theRegisterPassableandTrivialRegisterPassabletraits instead. Use of either decorator now produces a hard error pointing to the trait equivalent. - The legacy
__moveinit__and__copyinit__method names are no longer auto-rewritten to the unified__init__form. Rename these methods to__init__with keyword-onlytake: Selfandcopy: Selfarguments, respectively, as introduced by init unification in 26.2. Existing legacy spellings now fail to compile with errors such asno matching function in initializationrather than being silently rewritten.
- The
-
API removals beyond the deprecation removals already noted:
- Removed the
param_env.mojomodule. Usedefines.mojoinstead. - Removed
LinkedList.__getitem__. Indexing aLinkedListis O(n), and exposing__getitem__encouraged accidentally quadratic code; iterate the list instead. - Removed the unused
UIntSizedtrait and its prelude re-export. - Removed the
pdl_levelparameter fromelementwise,reduction, andreducescatterkernel APIs. PDL usage is now an internal compile-time default.
- Removed the
🛠️ Fixed
-
Fixed
math.sqrtonFloat64on NVIDIA GPU producing a crypticcould not find LLVM intrinsic: "llvm.nvvm.sqrt.approx.d"failure at LLVM IR translation time.math.sqrtnow rejectsFloat64on NVIDIA GPU at compile time with the messageDType.float64 isn't supported for approx sqrt on NVIDIA GPU. The existingmath.sinandmath.cosconstraint messages were also sharpened to name the op (DType.float64 isn't supported for sin/cos on NVIDIA GPU). (Issue #6434) -
Fixed pack inference failing with
could not infer type of parameter pack ... given value with unresolved typewhen passing list, dict, set, or slice literals to a*Ts-bound variadic pack parameter (e.g.def foo[*Ts: Iterable](*args: *Ts)). Pack inference now applies the same default-type fallback that single-argument trait-bound parameters already use, sofoo([1, 2, 3], [4, 5, 6])resolves each literal to its default type (e.g.List[Int]) before binding the pack. -
Fixed
mojoaborting at startup withstd::filesystem::filesystem_errorwhen$HOMEis not traversable by the running UID (common in containerized CI where the image's build-time UID differs from the runtime UID). The config search now treats permission errors as "not found" and falls through to the next candidate. (Issue #6412) -
mojo runandmojo debugnow honor-Xlinkerflags by loading the referenced shared libraries into the in-process JIT. Previously the flags were dropped (with a-Xlinker argument unusedwarning), leaving programs that called into external shared libraries viaexternal_callunable to resolve those symbols at runtime (somojo buildworked butmojo rundid not). The supported forms mirror what the system linker accepts:-Xlinker -L<dir>,-Xlinker -l<name>,-Xlinker -rpath <dir>, and-Xlinker <absolute-path-to-shared-library>. Flags that have no meaning under JIT are reported as a warning and ignored. (Issue #6155) -
Fixed
libpythonauto-discovery failing for Python 3.14 free-threaded builds. The discovery script constructed the library filename without the ABI flags suffix (e.g. looked forlibpython3.14.dylibinstead oflibpython3.14t.dylib). (Issue #6366) -
Fixed
RTLD.LOCALhaving the wrong value on Linux. It was set to4(RTLD_NOLOAD) instead of0, causingdlopenwithRTLD.NOW | RTLD.LOCALto fail. (Issue #6410) -
Fixed
mojo formatcrashing after upgrading Mojo versions due to a stale grammar cache. (Issue #6144) -
Fixed
atofproducing incorrect results for floats near the normal/subnormal boundary (e.g.,Float64("4.4501363245856945e-308")returned half the correct value). (#6196) -
Issue #5872: Fixed a compiler crash ("'get_type_name' requires a concrete type") when using default
Writable,Equatable, orHashableimplementations on structs with MLIR-type fields (e.g.__mlir_type.index). The compiler now correctly reports that the field does not implement the required trait. -
Fixed
Atomic.storesilently dropping the requestedscope. The previous implementation lowered toatomicrmw xchgwithout forwardingsyncscope, soAtomic[..., scope="device"].store(...)was emitting a system-scope store on NVPTX (extra L2/NVLink fences) and an over-synchronized store on AMDGPU.Atomic.storenow lowers viapop.store atomic syncscope(...), emittingst.release.<scope>on NVPTX and a properly-scoped LLVM atomic store on AMDGPU. The Mojo API surface is unchanged. -
Fixed
Process.run()not inheriting the parent's environment variables. Child processes spawned viaProcess.run()now correctly receive the parent's environment. -
Fixed
\xhhand\oooescape sequences in string literals being interpreted as raw bytes instead of Unicode code points, which produced malformed UTF-8 for values>= 0x80. The escapes now match Pythonstrsemantics (and the existing\u/\Uhandling):"\x85"encodes U+0085 (NEL) as two UTF-8 bytes andord("\x85")returns133instead of5. Code that relied on\xhhto emit a single raw byte for non-ASCII values must construct the bytes explicitly (for example via aList[Byte]literal). (Issue #2842) -
Fixed incorrect data layout for
MI250XAMDGPU architectures. (Issue #6451 -
Fixed Apple Silicon target detection on macOS 26 producing unrecognized arch strings like
metal:2-metal4when the installed Xcode could not compile Metal 4.0; the-metal4suffix is now applied only when the toolchain supports it. -
Fixed
UnsafePointer.gather,UnsafePointer.scatter, andstrided_loadsilently reading zero on Apple GPU. The per-lane fallback reconstructed pointers viaunsafe_from_address=Int(addr), yielding a generic-address-space pointer the Apple AIR backend could not resolve. The fallback now uses typed pointer arithmetic on Apple GPU; NVIDIA, AMD, and CPU paths are unchanged. -
Fixed
rotate_leftandrotate_rightintrinsics failing to lower on Apple GPU. Both now lower correctly to the Apple AIR backend. -
Fixed
TileTensor.write_to()only handling 2D static-shape tensors; 1D, 3D+, nested-layout, and dynamic-shape tensors now print correctly via a generic elementwise fallback, and all ranks use a bracket-delimited, comma-separated format. -
Fixed incorrect alignment in
TileTensor.__getitem__. -
Fixed
TileTensorSIMD loads/stores on CPU to usealignment=1, preventing segfaults when underlying data is not naturally aligned. GPU still uses aligned access where the layout guarantees alignment. -
Fixed
complement()intile_layoutreturning a static shape of0when givenUNKNOWN_VALUEas the size; it now propagatesUNKNOWN_VALUEso downstream layout algebra falls back to runtime dimensions, restoring correct bounds checks forLayoutTensor.flatten().vectorize[N](). -
Fixed
idx2crdreturning incorrect coordinates for nested layouts. -
Fixed
mojo --versionprinting the MAX version instead of the Mojo compiler version. -
Fixed
comptimeand/orexpressions to accept anyBoolableoperands, matching runtime behavior. This also enables mixed-type expressions likecomptime if some_Bool and some_Optional. -
Fixed several codegen correctness issues affecting valid Mojo programs: an SRoA miscompile that incorrectly promoted arrays accessed via dynamic offsets through a constant GEP; a use-after-free where destructors of live owned values were inserted before, rather than after, a
lit.ref.storeinto a ref with#lit.any.origin; silent memory corruption when callingabi("C")functions that returned structs viasret; and bogusexisting function with conflicting attributeserrors when calling the same external function more than once with ansret/byvalABI. -
Fixed several
mojo-lsp-servercrashes affecting REPL/notebook contexts, parameter-pack-related diagnostics, files importing from.mojopkg, and files using stateless closures. The LSP also no longer mistakes REPL buffer identifiers (which contain a.mojoextension) for relative module imports. -
Fixed several debugger display issues: variables after their ASAP destruction point at
-O0now correctly show "not available" instead of stale values; unsigned integers (UInt,UInt8, etc.) display with correct unsigned semantics;refloop variables showindexinstead ofpointer<index>;Stringfields typed asScalar[T]andTuplevalues display correctly. -
Fixed two
mojo format(mblack) issues: it no longer loses thetprefix when splitting long t-string literals across lines, and no longer inserts a stray space between*and a complex operand in variadic pack unpacking annotations. -
Fixed
BitSet.set_allandBitSet.toggle_allwriting~0to every underlying 64-bit word, including bits beyond the logicalsizewhensizewas not a multiple of 64. Those stray high bits were counted by__len__, producing incorrect population counts; the methods now mask off the unused high bits. -
Fixed
syncwarpon AMD GPUs, which was previously implemented as a no-op. It now lowers tollvm.amdgcn.wave.barrier, providing the control-flow synchronization required to correctly sequence shared-memory writes followed by reads across lanes. -
Fixed
isnan,isinf, andisfinitefailing during LLVM lowering forfloat8_e3m4andfloat4_e2m1fn.float4_e2m1fn(no NaN/Inf encodings) folds to constant branches;float8_e3m4casts throughbfloat16to reuse the existingllvm.is.fpclasspath.