Skip to main content
Version: Nightly

v25.2 (2025-03-25)

✨ Highlights

  • Check out the new GPU basics section of the Mojo Manual and the Get started with GPU programming with Mojo and the MAX Driver tutorial for a guide to getting started with GPU programming in Mojo!

  • Some APIs in the gpu package were enhanced to simplify working with GPUs.

    • If you're executing a GPU kernel only once, you can now skip compiling it first before enqueueing it, and pass it directly to DeviceContext.enqueue_function().

    • The three separate methods on DeviceContext for asynchronously copying buffers between host and GPU memory have been combined to single overloaded enqueue_copy() method, and the three separate methods for synchronous copies have been combined into an overloaded copy_sync() method.

    • The gpu.shuffle module has been renamed to gpu.warp to better reflect its purpose.

    • The gpu package API documentation has been expanded, and API documentation for the layout package is underway, beginning with core types, functions, and traits.

    See the Standard library changes section of the changelog for more information.

  • The legacy borrowed/inout keywords and -> T as foo syntax are no longer supported and now generate a compiler error. Please move to read/mut/out argument syntax instead. See Argument conventions in the Mojo Manual for more information.

  • The standard library has many changes related to strings. Notably, the Char type has been renamed to Codepoint, to better capture its intended purpose of storing a single Unicode codepoint. Additionally, related method and type names have been updated as well. See Standard library changes for more details.

  • Support has been added for 128- and 256-bit signed and unsigned integers. This includes the DType aliases DType.int128, DType.uint128, DType.int256, and DType.uint256, as well as SIMD support for 128- and 256-bit signed and unsigned element types. Note that this exposes capabilities (and limitations) of LLVM, which may not always provide high performance for these types and may have missing operations like divide, remainder, etc. See Standard library changes for more details.

Language changes

  • References to aliases in struct types with unbound (or partially) bound parameters sets are now allowed as long as the referenced alias doesn't depend on any unbound parameters:

    struct StructWithParam[a: Int, b: Int]:
    alias a1 = 42
    alias a2 = a+1

    fn test():
    _ = StructWithParams.a1 # ok
    _ = StructWithParams[1].a2 # ok
    _ = StructWithParams.a2 # error, 'a' is unbound.
  • The Mojo compiler now warns about @parameter for with large loop unrolling factor (>1024 by default), which can lead to long compilation time and large generated code size. Set --loop-unrolling-warn-threshold to change default value to a different threshold or to 0 to disable the warning.

  • The Mojo compile-time interpreter can now handle many more LLVM intrinsics, including ones that return floating point values. This allows functions like round() to be constant folded when used in a compile-time context.

  • The Mojo compiler now has only one compile-time interpreter. It had two previously: one to handle a few cases that were important for dependent types in the parser (but which also had many limitations), and the primary one that ran at "instantiation" time which is fully general. This was confusing and caused a wide range of bugs. We've now removed the special case parse-time interpreter, replacing it with a more general solution for dependent types. This change should be invisible to most users, but should resolve a number of long-standing bugs and significantly simplifies the compiler implementation, allowing us to move faster.

Standard library changes

  • Optional, Span, and InlineArray have been added to the prelude. You now no longer need to explicitly import these types to use them in your program.

  • GPU programming changes:

    • You can now skip compiling a GPU kernel first before enqueueing it, and pass it directly to DeviceContext.enqueue_function():

      from gpu.host import DeviceContext

      fn func():
      print("Hello from GPU")

      with DeviceContext() as ctx:
      ctx.enqueue_function[func](grid_dim=1, block_dim=1)

      However, if you're reusing the same function and parameters multiple times, this incurs some overhead of around 50-500 nanoseconds per enqueue. So you can still compile the function first with DeviceContext.compile_function() and pass it to DeviceContext.enqueue_function() like this:

      with DeviceContext() as ctx:
      var compiled_func = ctx.compile_function[func]()
      # Multiple kernel launches with the same function/parameters
      ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
      ctx.enqueue_function(compiled_func, grid_dim=1, block_dim=1)
    • The following methods on DeviceContext:

      • enqueue_copy_to_device()
      • enqueue_copy_from_device()
      • enqueue_copy_device_to_device()

      have been combined to a single overloaded

      enqueue_copy() method. Additionally, the methods:

      • copy_to_device_sync()
      • copy_from_device_sync()
      • copy_device_to_device_sync()

      have been combined into an overloaded copy_sync() method.

    • The gpu.shuffle module has been renamed to gpu.warp to better reflect its purpose. For example:

      import gpu.warp as warp

      var val0 = warp.shuffle_down(x, offset)
      var val1 = warp.broadcast(x)
  • Support has been added for 128- and 256-bit signed and unsigned integers.

    • The following aliases have been added to the DType struct: DType.int128, DType.uint128, DType.int256, and DType.uint256.

    • The SIMD type now supports 128- and 256-bit signed and unsigned element types. Note that this exposes capabilities (and limitations) of LLVM, which may not always provide high performance for these types and may have missing operations like divide, remainder, etc.

    • The following Scalar aliases for 1-element SIMD values have been added: Int128, UInt128, Int256, and UInt256.

  • String and friends:

    • The Char type has been renamed to Codepoint, to better capture its intended purpose of storing a single Unicode codepoint. Additionally, related method and type names have been updated as well, including:

    • StringSlice now supports several additional methods moved from String. The existing String methods have been updated to instead call the corresponding new StringSlice methods:

    • Added a StringSlice.is_codepoint_boundary() method for querying if a given byte index is a boundary between encoded UTF-8 codepoints.

    • StringSlice.__getitem__(Slice) now raises an error if the provided slice start and end positions do not fall on a valid codepoint boundary. This prevents construction of malformed StringSlice values, which could lead to memory unsafety or undefined behavior. For example, given a string containing multi-byte encoded data, like:

      str_slice = "Hi👋!"

      and whose in-memory and decoded data looks like:

      String Hi👋!
      Codepoint Characters H i 👋 !
      Codepoints 72 105 128075 33
      Bytes 72 105 240 159 145 139 33
      Index 0 1 2 3 4 5 6

      attempting to slice bytes [3-5) with str_slice[3:5] would previously erroneously produce a malformed StringSlice as output that did not correctly decode to anything:

      String invalid
      Codepoint Characters invalid
      Codepoints invalid
      Bytes 159 145
      Index 0 1

      The same statement will now raise an error informing the user that their indices are invalid.

    • The StringLiteral.get[value]() method, which converts a compile-time value of Stringable type has been changed to a function named get_string_literal[value]().

  • Collections:

  • The design of the IntLiteral and FloatLiteral types has been changed to maintain their compile-time-only value as a parameter instead of a stored field. This correctly models that infinite precision literals are not representable at runtime, and eliminates a number of bugs hit in corner cases. This is made possible by enhanced dependent type support in the compiler.

  • The Buffer struct has been removed in favor of Span and NDBuffer.

  • The round() function is now fixed to perform "round half to even" (also known as "bankers' rounding") instead of "round half away from zero".

  • The UnsafePointer.alloc() method has changed to produce pointers with an empty Origin parameter, instead of with MutableAnyOrigin. This mitigates an issue with the any origin parameter extending the lifetime of unrelated local variables for this common method.

  • Several more packages are now documented:

    • compile package
    • gpu package
    • layout package is underway, beginning with core types, functions, and traits
  • Added a new sys.is_compile_time() function. This enables you to query whether code is being executed at compile time or not. For example:

    from sys import is_compile_time

    fn check_compile_time() -> String:
    if is_compile_time():
    return "compile time"
    else:
    return "runtime"

    def main():
    alias var0 = check_compile_time()
    var var1 = check_compile_time()
    print("var0 is evaluated at ", var0, " , while var1 is evaluated at ", var1)

    will print var0 is evaluated at compile time, while var1 is evaluated at runtime.

Tooling changes

  • Mojo API documentation generation is now able to display function and struct parameter references inside nested parametric types using names instead of indices. For example, instead of


    sort[type: CollectionElement, //, cmp_fn: fn($1|0, $1|0) capturing -> Bool](span: Span[type, origin])

    it now displays


    sort[type: CollectionElement, //, cmp_fn: fn(type, type) capturing -> Bool](span: Span[type, origin])

❌ Removed

  • Use of legacy argument conventions like inout and the use of as in named results now produces an error message instead of a warning.

  • Direct access to List.size has been removed. Use the public API instead.

    Examples:

    Extending a List:

    base_data = List[Byte](1, 2, 3)

    data_list = List[Byte](4, 5, 6)
    ext_data_list = base_data.copy()
    ext_data_list.extend(data_list) # [1, 2, 3, 4, 5, 6]

    data_span = Span(List[Byte](4, 5, 6))
    ext_data_span = base_data.copy()
    ext_data_span.extend(data_span) # [1, 2, 3, 4, 5, 6]

    data_vec = SIMD[DType.uint8, 4](4, 5, 6, 7)
    ext_data_vec_full = base_data.copy()
    ext_data_vec_full.extend(data_vec) # [1, 2, 3, 4, 5, 6, 7]

    ext_data_vec_partial = base_data.copy()
    ext_data_vec_partial.extend(data_vec, count=3) # [1, 2, 3, 4, 5, 6]

    Slicing and extending a list efficiently:

    base_data = List[Byte](1, 2, 3, 4, 5, 6)
    n4_n5 = Span(base_data)[3:5]
    extra_data = Span(List[Byte](8, 10))
    end_result = List[Byte](capacity=len(n4_n5) + len(extra_data))
    end_result.extend(n4_n5)
    end_result.extend(extra_data) # [4, 5, 8, 10]
  • InlinedFixedVector and InlineList have been removed. Instead, use InlineArray when the upper bound is known at compile time. If the upper bound is not known until runtime, use List with the capacity constructor to minimize allocations.

🛠️ Fixed

Special thanks

Special thanks to our community contributors: @bgreni, @fnands, @illiasheshyn, @izo0x90, @lydiandy, @martinvuyk, @msaelices, @owenhilyard, @rd4com, @yinonburgansky