IMPORTANT: To view this page as Markdown, append `.md` to the URL (e.g. /docs/manual/basics.md). For the complete Mojo documentation index, see llms.txt.
Skip to main content
Version: Nightly
For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

DeviceGraphBuilder

struct DeviceGraphBuilder

Builder for explicit device graph construction.

A DeviceGraphBuilder is obtained from DeviceContext.create_graph_builder(). Callers add kernel nodes via add_function() and then call instantiate() to produce a reusable DeviceGraph.

Example:

from std.gpu.host import DeviceContext

def kernel(x: Int):
print("Value:", x)

with DeviceContext() as ctx:
var compiled_fn = ctx.compile_function[kernel]()
var builder = ctx.create_graph_builder()
_ = builder.add_function(compiled_fn, 42, grid_dim=1, block_dim=1, dependencies=[])
var graph = builder^.instantiate()
graph.replay()
ctx.synchronize()

Implemented traits

AnyType, ImplicitlyDestructible, Movable

Methods

__init__

def __init__(out self, *, copy: Self)

Creates a copy of an existing graph builder by incrementing its reference count.

Args:

  • copy (Self): The graph builder to copy.

__del__

def __del__(deinit self)

Releases resources associated with this graph builder.

add_function

def add_function[*Ts: DevicePassable](self, f: DeviceFunction[target=f.target, compile_options=f.compile_options, link_options=f.link_options, _ptxas_info_verbose=f._ptxas_info_verbose], *args: *Ts.values, *, grid_dim: Dim, block_dim: Dim, var dependencies: List[DeviceGraphNode], cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None))) -> DeviceGraphNode

Adds a type-checked compiled kernel function as a node in this graph.

Parameters:

Args:

Returns:

DeviceGraphNode: A handle to the newly added kernel-dispatch node.

Raises:

If adding the node fails.

def add_function[FuncType: def() -> None, //, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, func: FuncType, grid_dim: Dim, block_dim: Dim, *, var dependencies: List[DeviceGraphNode], cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None))) -> DeviceGraphNode

Compiles and adds a capturing kernel closure as a node in this graph.

This overload is for kernels that capture variables from their enclosing scope using the {var} capture syntax. Compilation is performed automatically using the DeviceContext that created this builder, so no separate compile step is needed.

Example:

from std.gpu import global_idx
from std.gpu.host import DeviceContext

with DeviceContext() as ctx:
var scale: Float32 = 2.0
var buf = ctx.enqueue_create_buffer[DType.float32](256)
var ptr = buf.unsafe_ptr()

def scale_kernel() {var}:
var i = global_idx.x
ptr[i] = Float32(i) * scale

var builder = ctx.create_graph_builder()
_ = builder.add_function(
scale_kernel, grid_dim=1, block_dim=256, dependencies=[]
)
var graph = builder^.instantiate()
graph.replay()
ctx.synchronize()

Parameters:

Args:

  • func (FuncType): The capturing kernel closure to compile and add as a graph node.
  • grid_dim (Dim): Dimensions of the compute grid.
  • block_dim (Dim): Dimensions of each thread block.
  • dependencies (List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.
  • cluster_dim (OptionalReg[Dim]): Cluster dimensions (optional).
  • shared_mem_bytes (OptionalReg[Int]): Amount of dynamic shared memory per block.
  • attributes (List[LaunchAttribute]): Launch attributes.
  • constant_memory (List[ConstantMemoryMapping]): Constant memory mappings.

Returns:

DeviceGraphNode: A handle to the newly added kernel-dispatch node.

Raises:

If adding the node fails.

add_copy

def add_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: HostBuffer[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode

Adds a host-to-device memcpy node to the graph.

The number of bytes copied is determined by the size of the device buffer.

Parameters:

  • dtype (DType): Type of the data being copied.

Args:

  • dst_buf (DeviceBuffer[dtype]): Device buffer to copy to.
  • src_buf (HostBuffer[dtype]): Host buffer to copy from.
  • dependencies (List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.

Returns:

DeviceGraphNode: A handle to the newly added memcpy node.

Raises:

If adding the node fails.

def add_copy[dtype: DType](self, dst_buf: HostBuffer[dtype], src_buf: DeviceBuffer[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode

Adds a device-to-host memcpy node to the graph.

The number of bytes copied is determined by the size of the device buffer.

Parameters:

  • dtype (DType): Type of the data being copied.

Args:

  • dst_buf (HostBuffer[dtype]): Host buffer to copy to.
  • src_buf (DeviceBuffer[dtype]): Device buffer to copy from.
  • dependencies (List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.

Returns:

DeviceGraphNode: A handle to the newly added memcpy node.

Raises:

If adding the node fails.

def add_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: DeviceBuffer[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode

Adds a device-to-device memcpy node to the graph.

Both buffers must belong to the same context as this builder; cross-context copies are not supported in graphs. The number of bytes copied is determined by the size of the source buffer.

Parameters:

  • dtype (DType): Type of the data being copied.

Args:

  • dst_buf (DeviceBuffer[dtype]): Device buffer to copy to.
  • src_buf (DeviceBuffer[dtype]): Device buffer to copy from. Must be the same size as dst_buf.
  • dependencies (List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.

Returns:

DeviceGraphNode: A handle to the newly added memcpy node.

Raises:

If adding the node fails.

add_memset

def add_memset[dtype: DType](self, dst: DeviceBuffer[dtype], val: Scalar[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode

Adds a memset node to the graph that sets all elements of dst to val.

Parameters:

  • dtype (DType): Type of the data stored in the buffer.

Args:

  • dst (DeviceBuffer[dtype]): Destination buffer.
  • val (Scalar[dtype]): Value to set all elements of dst to.
  • dependencies (List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.

Returns:

DeviceGraphNode: A handle to the newly added memset node.

Raises:

If adding the node fails. The underlying graph APIs cannot express an 8-byte memset whose high and low 32-bit halves differ as a single node, so such patterns will return an error.

add_empty

def add_empty(self, *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode

Adds an empty (no-op) node to the graph.

Empty nodes perform no work at execution time. They are used purely for transitive ordering: a single empty node fanned in from m predecessors and out to n successors expresses an m-to-n barrier using m + n edges instead of m * n, and serves as a stable handle for "the completion of this phase" when the producer set is not visible to the consumer.

Args:

  • dependencies (List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.

Returns:

DeviceGraphNode: A handle to the newly added empty node.

Raises:

If adding the node fails.

collect_dependencies

def collect_dependencies(self, work: T) -> DeviceGraphNode

Runs work and returns a single empty node that joins every node added to this builder during its execution.

The returned handle is suitable for use as a one-element dependencies= entry on a downstream add_* call. The empty node performs no work at execution time; it exists purely as a fan-in barrier so the caller does not need to thread the producer set's individual handles to every consumer.

Example:

from std.gpu.host import DeviceContext, DeviceGraphBuilder

with DeviceContext() as ctx:
var builder = ctx.create_graph_builder()

def add_producers(b: DeviceGraphBuilder) raises {read} -> None:
_ = b.add_memset(buf_a, UInt8(1), dependencies=[])
_ = b.add_memset(buf_b, UInt8(2), dependencies=[])

var producers_join = builder.collect_dependencies(add_producers)
_ = builder.add_copy(
buf_c, host_src, dependencies=[producers_join]
)
var graph = builder^.instantiate()
graph.replay()

Args:

  • work (T): Closure whose effects on this builder are captured. The builder is passed as work's sole argument; the closure must not capture the same builder, since doing so would alias with this method's receiver. The closure may add any number of nodes (zero or more) via any of the add_* methods.

Returns:

DeviceGraphNode: Handle of the empty node that joins every node added by work.

Raises:

Anything work itself raises, or anything raised while adding the join node.

instantiate

def instantiate(var self) -> DeviceGraph

Instantiates the constructed graph into an executable device graph.

Finalizes the graph construction and produces a DeviceGraph that can be replayed multiple times.

Returns:

DeviceGraph: The instantiated device graph.

Raises:

If instantiation fails.