For the complete Mojo documentation index, see llms.txt. Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).
DeviceGraphBuilder
struct DeviceGraphBuilder
Builder for explicit device graph construction.
A DeviceGraphBuilder is obtained from
DeviceContext.create_graph_builder().
Callers add kernel nodes via add_function() and then call
instantiate() to produce a reusable DeviceGraph.
Example:
from std.gpu.host import DeviceContext
def kernel(x: Int):
print("Value:", x)
with DeviceContext() as ctx:
var compiled_fn = ctx.compile_function[kernel]()
var builder = ctx.create_graph_builder()
_ = builder.add_function(compiled_fn, 42, grid_dim=1, block_dim=1, dependencies=[])
var graph = builder^.instantiate()
graph.replay()
ctx.synchronize()
Implemented traits
AnyType,
ImplicitlyDestructible,
Movable
Methods
__init__
def __init__(out self, *, copy: Self)
Creates a copy of an existing graph builder by incrementing its reference count.
Args:
- copy (
Self): The graph builder to copy.
__del__
def __del__(deinit self)
Releases resources associated with this graph builder.
add_function
def add_function[*Ts: DevicePassable](self, f: DeviceFunction[target=f.target, compile_options=f.compile_options, link_options=f.link_options, _ptxas_info_verbose=f._ptxas_info_verbose], *args: *Ts.values, *, grid_dim: Dim, block_dim: Dim, var dependencies: List[DeviceGraphNode], cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None))) -> DeviceGraphNode
Adds a type-checked compiled kernel function as a node in this graph.
Parameters:
- *Ts (
DevicePassable): Argument types (must beDevicePassable).
Args:
- f (
DeviceFunction[target=f.target, compile_options=f.compile_options, link_options=f.link_options, _ptxas_info_verbose=f._ptxas_info_verbose]): The type-checked compiled function to add. Must have been compiled viaDeviceContext.compile_function(). - *args (
*Ts.values): Arguments to pass to the kernel. - grid_dim (
Dim): Dimensions of the compute grid. - block_dim (
Dim): Dimensions of each thread block. - dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors. - cluster_dim (
OptionalReg[Dim]): Cluster dimensions (optional). - shared_mem_bytes (
OptionalReg[Int]): Amount of dynamic shared memory per block. - attributes (
List[LaunchAttribute]): Launch attributes. - constant_memory (
List[ConstantMemoryMapping]): Constant memory mappings.
Returns:
DeviceGraphNode: A handle to the newly added kernel-dispatch node.
Raises:
If adding the node fails.
def add_function[FuncType: def() -> None, //, dump_asm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, dump_llvm: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _dump_sass: Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path] = False, _ptxas_info_verbose: Bool = False](self, func: FuncType, grid_dim: Dim, block_dim: Dim, *, var dependencies: List[DeviceGraphNode], cluster_dim: OptionalReg[Dim] = None, shared_mem_bytes: OptionalReg[Int] = None, var attributes: List[LaunchAttribute] = List(__list_literal__=NoneType(None)), var constant_memory: List[ConstantMemoryMapping] = List(__list_literal__=NoneType(None))) -> DeviceGraphNode
Compiles and adds a capturing kernel closure as a node in this graph.
This overload is for kernels that capture variables from their
enclosing scope using the {var} capture syntax. Compilation is
performed automatically using the DeviceContext that created this
builder, so no separate compile step is needed.
Example:
from std.gpu import global_idx
from std.gpu.host import DeviceContext
with DeviceContext() as ctx:
var scale: Float32 = 2.0
var buf = ctx.enqueue_create_buffer[DType.float32](256)
var ptr = buf.unsafe_ptr()
def scale_kernel() {var}:
var i = global_idx.x
ptr[i] = Float32(i) * scale
var builder = ctx.create_graph_builder()
_ = builder.add_function(
scale_kernel, grid_dim=1, block_dim=256, dependencies=[]
)
var graph = builder^.instantiate()
graph.replay()
ctx.synchronize()
Parameters:
- FuncType (
def() -> None): The type of the closure function (usually inferred). - dump_asm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the compiled assembly, passTrue, or a file path to dump to, or a function returning a file path. - dump_llvm (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): To dump the generated LLVM code, passTrue, or a file path to dump to, or a function returning a file path. - _dump_sass (
Variant[Bool, Path, StringSlice[StaticConstantOrigin], def() capturing -> Path]): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. PassTrue, or a file path to dump to, or a function returning a file path. - _ptxas_info_verbose (
Bool): Only runs on NVIDIA targets, and requires CUDA Toolkit to be installed. Changesdump_asmto output verbose PTX assembly (defaultFalse).
Args:
- func (
FuncType): The capturing kernel closure to compile and add as a graph node. - grid_dim (
Dim): Dimensions of the compute grid. - block_dim (
Dim): Dimensions of each thread block. - dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors. - cluster_dim (
OptionalReg[Dim]): Cluster dimensions (optional). - shared_mem_bytes (
OptionalReg[Int]): Amount of dynamic shared memory per block. - attributes (
List[LaunchAttribute]): Launch attributes. - constant_memory (
List[ConstantMemoryMapping]): Constant memory mappings.
Returns:
DeviceGraphNode: A handle to the newly added kernel-dispatch node.
Raises:
If adding the node fails.
add_copy
def add_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: HostBuffer[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode
Adds a host-to-device memcpy node to the graph.
The number of bytes copied is determined by the size of the device buffer.
Parameters:
- dtype (
DType): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[dtype]): Device buffer to copy to. - src_buf (
HostBuffer[dtype]): Host buffer to copy from. - dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.
Returns:
DeviceGraphNode: A handle to the newly added memcpy node.
Raises:
If adding the node fails.
def add_copy[dtype: DType](self, dst_buf: HostBuffer[dtype], src_buf: DeviceBuffer[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode
Adds a device-to-host memcpy node to the graph.
The number of bytes copied is determined by the size of the device buffer.
Parameters:
- dtype (
DType): Type of the data being copied.
Args:
- dst_buf (
HostBuffer[dtype]): Host buffer to copy to. - src_buf (
DeviceBuffer[dtype]): Device buffer to copy from. - dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.
Returns:
DeviceGraphNode: A handle to the newly added memcpy node.
Raises:
If adding the node fails.
def add_copy[dtype: DType](self, dst_buf: DeviceBuffer[dtype], src_buf: DeviceBuffer[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode
Adds a device-to-device memcpy node to the graph.
Both buffers must belong to the same context as this builder; cross-context copies are not supported in graphs. The number of bytes copied is determined by the size of the source buffer.
Parameters:
- dtype (
DType): Type of the data being copied.
Args:
- dst_buf (
DeviceBuffer[dtype]): Device buffer to copy to. - src_buf (
DeviceBuffer[dtype]): Device buffer to copy from. Must be the same size asdst_buf. - dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.
Returns:
DeviceGraphNode: A handle to the newly added memcpy node.
Raises:
If adding the node fails.
add_memset
def add_memset[dtype: DType](self, dst: DeviceBuffer[dtype], val: Scalar[dtype], *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode
Adds a memset node to the graph that sets all elements of dst to val.
Parameters:
- dtype (
DType): Type of the data stored in the buffer.
Args:
- dst (
DeviceBuffer[dtype]): Destination buffer. - val (
Scalar[dtype]): Value to set all elements ofdstto. - dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.
Returns:
DeviceGraphNode: A handle to the newly added memset node.
Raises:
If adding the node fails. The underlying graph APIs cannot express an 8-byte memset whose high and low 32-bit halves differ as a single node, so such patterns will return an error.
add_empty
def add_empty(self, *, var dependencies: List[DeviceGraphNode]) -> DeviceGraphNode
Adds an empty (no-op) node to the graph.
Empty nodes perform no work at execution time. They are used purely
for transitive ordering: a single empty node fanned in from m
predecessors and out to n successors expresses an m-to-n
barrier using m + n edges instead of m * n, and serves as a
stable handle for "the completion of this phase" when the producer
set is not visible to the consumer.
Args:
- dependencies (
List[DeviceGraphNode]): Explicit list of predecessor node handles. An empty list makes the new node a graph root with no predecessors; a non-empty list uses those exact handles as predecessors.
Returns:
DeviceGraphNode: A handle to the newly added empty node.
Raises:
If adding the node fails.
collect_dependencies
def collect_dependencies(self, work: T) -> DeviceGraphNode
Runs work and returns a single empty node that joins every node added to this builder during its execution.
The returned handle is suitable for use as a one-element
dependencies= entry on a downstream add_* call. The empty
node performs no work at execution time; it exists purely as a
fan-in barrier so the caller does not need to thread the
producer set's individual handles to every consumer.
Example:
from std.gpu.host import DeviceContext, DeviceGraphBuilder
with DeviceContext() as ctx:
var builder = ctx.create_graph_builder()
def add_producers(b: DeviceGraphBuilder) raises {read} -> None:
_ = b.add_memset(buf_a, UInt8(1), dependencies=[])
_ = b.add_memset(buf_b, UInt8(2), dependencies=[])
var producers_join = builder.collect_dependencies(add_producers)
_ = builder.add_copy(
buf_c, host_src, dependencies=[producers_join]
)
var graph = builder^.instantiate()
graph.replay()
Args:
- work (
T): Closure whose effects on this builder are captured. The builder is passed aswork's sole argument; the closure must not capture the same builder, since doing so would alias with this method's receiver. The closure may add any number of nodes (zero or more) via any of theadd_*methods.
Returns:
DeviceGraphNode: Handle of the empty node that joins every node added by
work.
Raises:
Anything work itself raises, or anything raised while
adding the join node.
instantiate
def instantiate(var self) -> DeviceGraph
Instantiates the constructed graph into an executable device graph.
Finalizes the graph construction and produces a DeviceGraph that
can be replayed multiple times.
Returns:
DeviceGraph: The instantiated device graph.
Raises:
If instantiation fails.