> For the complete Mojo documentation index, see [llms.txt](/llms.txt).
> Markdown versions of all pages are available by appending .md to any URL (e.g. /docs/manual/basics.md).

# Get started with GPU programming

This tutorial introduces you to GPU programming with Mojo. You'll learn how to
write a simple program that performs vector addition on a GPU, exploring
fundamental concepts of GPU programming along the way.

By the end of this tutorial, you will:

- Understand basic GPU programming concepts like grids and thread blocks.
- Learn how to move data between CPU and GPU memory.
- Write and compile a simple GPU kernel function.
- Execute parallel computations on the GPU.
- Understand the asynchronous nature of GPU programming.

We'll build everything step-by-step, starting with the basics and gradually
adding more complexity. The concepts you learn here will serve as a foundation
for more advanced GPU programming with Mojo. If you just want to see the
finished code, you can [get it on
GitHub](https://github.com/modular/modular/tree/main/mojo/examples/gpu-intro).
See [GPU compatibility](/docs/requirements/#gpu-compatibility) for a list of
compatible GPUs and software requirements.

:::tip

**Tip:** To use AI coding assistants with Mojo, see our guide for
[using Mojo AI skills](/docs/tools/skills/).

:::

## 1. Create a Mojo project

To install Mojo, we recommend using [`pixi`](https://pixi.sh/latest/) (for
other options, see the [install guide](/install/)).

1. If you don't have `pixi`, you can install it with this command:

    ```sh
    curl -fsSL https://pixi.sh/install.sh | sh
    ```

2. Navigate to the directory in which you want to create the project and
   execute:

    ```bash
    pixi init gpu-intro \
      -c https://conda.modular.com/max/ -c conda-forge \
      && cd gpu-intro
    ```

    This creates a project directory named `gpu-intro`, adds the Modular conda
    package channel, and enters the directory.

3. Install the `mojo` package:

    ```bash
    pixi add mojo
    ```

4. Verify the project is configured correctly by checking the version of Mojo
   that's installed within our project's virtual environment:

    ```bash
    pixi run mojo --version
    ```

    You should see a version string indicating the version of Mojo installed.
    By default, this should be the latest nightly version.

5. Activate the project's virtual environment:

    ```bash
    pixi shell
    ```

    Later on, when you want to exit the virtual environment, just type `exit`.

## 2. Get a reference to the GPU device

The [`DeviceContext`](/docs/std/gpu/host/device_context/DeviceContext/) type
represents a logical instance of a GPU device. It provides methods for
allocating memory on the device, copying data between the host CPU and the GPU,
and compiling and running functions (also known as *kernels*) on the device.

Use the
[`DeviceContext()`](/docs/std/gpu/host/device_context/DeviceContext/#__init__)
constructor to get a reference to the GPU device. The constructor raises an
error if no compatible GPU is available. You can use the
[`has_accelerator()`](/docs/std/sys/info/has_accelerator/) function to check
if a compatible GPU is available.

Let's start by writing a program that checks if a GPU is available and then
obtains a reference to the GPU device. Using any editor, create a file named
`vector_addition.mojo` with the following code:

```mojo title="vector_addition.mojo"
from std.gpu.host import DeviceContext
from std.sys import has_accelerator

def main() raises:
    comptime if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        print("Found GPU:", ctx.name())
```

Save the file and run it using the `mojo` CLI:

```bash
mojo vector_addition.mojo
```

You should see output like the following (depending on the type of GPU you
have):

```output
Found GPU: NVIDIA A10G
```

:::note

Mojo requires a
[compatible GPU development environment](/docs/requirements/#gpu-compatibility)
to compile kernel functions, otherwise it raises a compile-time error. In our
code, we're using the
[`comptime if`](/docs/manual/metaprogramming/comptime-evaluation/#comptime-if)
statement to evaluate the `has_accelerator()` function at compile time and
compile only the corresponding branch of the `if` statement. As a result, if you
don't have a compatible GPU development environment, you'll see the following
message when you run the program:

```output
No compatible GPU found
```

In that case, you need to find a system that has a supported GPU to continue
with this tutorial.

:::

## 3. Define a simple kernel

A GPU *kernel* is simply a function that runs on a GPU, executing a specific
computation on a large dataset in parallel across thousands or millions of
*threads*. You might already be familiar with threads when programming for a
CPU, but GPU threads are different. On a CPU, threads are managed by the
operating system and can perform completely independent tasks, such as managing
a user interface, fetching data from a database, and so on. But on a GPU,
threads are managed by the GPU itself. All the threads on a GPU execute the same
kernel function, but they each work on a different part of the data.

When you run a kernel, you need to specify the number of threads you want to
use. The number of threads you specify depends on the size of the data you want
to process and the amount of parallelism you want to achieve. A common strategy
is to use one thread per element of data in the result. So if you're performing
an element-wise addition of two 1,024-element vectors, you'd use 1,024 threads.

A *grid* is the top-level organizational structure for the threads executing a
kernel function. A grid consists of multiple *thread blocks*, which are further
divided into individual threads that execute the kernel function concurrently.
The GPU assigns a unique block index to each thread block, and a unique thread
index to each thread within a block. Threads within the same thread block can
share data through shared memory and synchronize using built-in mechanisms, but
they cannot directly communicate with threads in other blocks. For this
tutorial, we won't get into the details of why or how to do this, but it's an
important concept to keep in mind when you're writing more complex kernels.

To better understand how grids, thread blocks, and threads are organized, let's
write a simple kernel function that prints the thread block and thread indices.
Add the following code to your `vector_addition.mojo` file:

```mojo title="vector_addition.mojo"
from std.gpu import block_idx, thread_idx

def print_threads():
    """Print thread IDs."""

    print(block_idx.x, thread_idx.x, sep="\t")
```

## 4. Compile and run the kernel

Next, we need to update the `main()` function to compile the kernel function for
our GPU and then run it, specifying the number of thread blocks in the grid and
the number of threads per thread block. For this initial example, let's define a
grid consisting of 2 thread blocks, each with 64 threads.

Modify the `main()` function so that your program looks like this:

```mojo title="vector_addition.mojo"
from std.sys import has_accelerator

from std.gpu.host import DeviceContext
from std.gpu import block_idx, thread_idx

def print_threads():
    """Print thread IDs."""

    print(block_idx.x, thread_idx.x, sep="\t")

def main() raises:
    comptime if not has_accelerator():
        print("No compatible GPU found")
    else:
        ctx = DeviceContext()
        print("block\tthread")
        ctx.enqueue_function[print_threads](https://mojolang.org/docs/manual/gpu/grid_dim=2, block_dim=64.md)
        ctx.synchronize()
        print("Program finished")
```

Save the file and run it:

```bash
mojo vector_addition.mojo
```

You should see something like the following output (which is abbreviated here).
The kernel prints the block and thread index for each thread as tab-separated
numeric values, with the `main()` function printing a header row before
launching the kernel. The output order is indeterminate because the threads
execute concurrently.

```output
block	thread
1	32
1	33
1	34
...
0	30
0	31
Program finished
```

:::note Known limitation

On Apple silicon GPUs, each `print()` call inside a GPU kernel currently
supports at most one string literal argument. To ensure readable output across
all GPU platforms, the examples here print numeric values only, with the host
printing column headers before launching the kernel. This restriction doesn't
apply to NVIDIA or AMD GPUs.

:::

Typical CPU-GPU interaction is asynchronous, allowing the GPU to process tasks
while the CPU is busy with other work. Each `DeviceContext` has an associated
stream of queued operations to execute on the GPU. Operations within a stream
execute in the order they are issued.

The
[`enqueue_function()`](/docs/std/gpu/host/device_context/DeviceContext/#enqueue_function)
method compiles a kernel function and enqueues it to run on the given device.
You must provide the name of the kernel function as a compile-time Mojo
parameter, and the following arguments:

- Any additional arguments specified by the kernel function definition (none, in
  this case).
- The grid dimensions using the `grid_dim` keyword argument.
- The thread block dimensions using the `block_dim` keyword argument.

(See the [Functions](/docs/manual/functions/) section of the Mojo Manual for
more information on Mojo function arguments and the
[Parameters](/docs/manual/parameters/) section for more information on Mojo
compile-time parameters and metaprogramming.)

:::note Compile-time type checking

The `compile_function()` and `enqueue_function()` methods type-check the
kernel function arguments at compile time. If an argument's type doesn't match
the kernel function's parameter list, you'll get a compile-time error instead
of a run-time failure.

:::

In this example, we're invoking the compiled kernel function with `grid_dim=2`
and `block_dim=64`, which means we're using a grid of 2 thread blocks with 64
threads each, for a total of 128 threads in the grid.

When you run a kernel, the GPU assigns each thread block within the grid to a
*streaming multiprocessor* for execution. A streaming multiprocessor (SM) is the
fundamental processing unit of a GPU, designed to execute multiple parallel
workloads efficiently. Each SM contains several cores, which perform the actual
computations of the threads executing on the SM, along with shared resources
like registers, shared memory, and control mechanisms to coordinate the
execution of threads. The number of SMs and the number of cores on a GPU depends
on its architecture. For example, the NVIDIA H100 PCIe contains 114 SMs, with
128 32-bit floating point cores per SM.

<figure>

![](../images/gpu/sm-architecture.png#light)
![](../images/gpu/sm-architecture-dark.png#dark)

<figcaption>**Figure 1.** High-level architecture of a streaming
multiprocessor (SM). (Click to enlarge.)</figcaption>

</figure>

Additionally, when an SM is assigned a thread block, it divides the block into
multiple *warps*, which are groups of 32 or 64 threads, depending on the GPU
architecture. These threads execute the same instruction simultaneously in a
*single instruction, multiple threads* (SIMT) model. The SM's *warp scheduler*
coordinates the execution of warps on the SM's cores.

<figure>

![](../images/gpu/grid-hierarchy.png#light)
![](../images/gpu/grid-hierarchy-dark.png#dark)

<figcaption>**Figure 2.** Hierarchy of threads running on a GPU, showing the
relationship of the grid, thread blocks, warps, and individual threads, based
on <cite><a
href="https://rocm.docs.amd.com/projects/HIP/en/latest/understand/programming_model.html">HIP
Programming Guide</a></cite></figcaption>

</figure>

Warps are used to efficiently utilize GPU hardware by maximizing throughput and
minimizing control overhead. Since GPUs are designed for high-performance
parallel processing, grouping threads into warps allows for streamlined
instruction scheduling and execution, reducing the complexity of managing
individual threads. Multiple warps from multiple thread blocks can be active
within an SM at any given time, enabling the GPU to keep execution units busy.
For example, if the threads of a particular warp are blocked waiting for data
from memory, the warp scheduler can immediately switch execution to another warp
that's ready to run.

After enqueuing the kernel function, we want to ensure that the CPU waits for it
to finish execution before exiting the program. We do this by calling the
[`synchronize()`](/docs/std/gpu/host/device_context/DeviceContext/#synchronize)
method of the `DeviceContext` object, which blocks until the device completes
all operations in its queue.

## 5. Manage grid dimensions

The previous step used a one-dimensional grid of 2 thread
blocks with 64 threads in each block. However, you can also organize the thread
blocks in a two- or three-dimensional grid. Similarly, you can arrange
the threads in a thread block across one, two, or three dimensions. Typically,
you determine the dimensions of the grid and thread blocks based on the
dimensionality of the data to process. For example, you might choose a
1-dimensional grid for processing large vectors, a 2-dimensional grid for
processing matrices, and a 3-dimensional grid for processing the frames of a
video.

<figure>

![](../images/gpu/multidimensional-grid.png#light)
![](../images/gpu/multidimensional-grid-dark.png#dark)

<figcaption>**Figure 3.** Organization of thread blocks and threads within a
grid. </figcaption>

</figure>

To better understand how grids, thread blocks, and threads work together, let's
modify our `print_threads()` kernel function to print the `x`, `y`, and `z`
components of the thread block and thread indices for each thread.

```mojo title="vector_addition.mojo"
def print_threads():
    """Print thread IDs."""

    print(
        block_idx.x,
        block_idx.y,
        block_idx.z,
        thread_idx.x,
        thread_idx.y,
        thread_idx.z,
        sep="\t",
    )
```

Then, update `main()` to print column headers and enqueue the kernel function
with a 2x2x1 grid of thread blocks and a 16x4x2 arrangement of threads within
each thread block:

```mojo title="vector_addition.mojo"
        print("block_idx\t\tthread_idx")
        print("x\ty\tz", "x\ty\tz", sep="\t")
        print("-" * 20, "-" * 20, sep="\t")
        ctx.enqueue_function[print_threads](https://mojolang.org/docs/manual/gpu/grid_dim=(2, 2, 1.md),
            block_dim=(16, 4, 2)
        )
```

Save the file and run it again:

```bash
mojo vector_addition.mojo
```

You should see something like the following output (which is abbreviated here):

```output
block_idx		thread_idx
x	y	z	x	y	z
--------------------	--------------------
1	1	0	0	2	0
1	1	0	1	2	0
1	1	0	2	2	0
...
0	0	0	14	1	0
0	0	0	15	1	0
Program finished
```

Try changing the grid and thread block dimensions to see how the output changes.

:::note

The maximum number of threads per thread block and threads per SM is
GPU-specific. For example, the NVIDIA A100 GPU has a maximum of 1,024 threads
per thread block and 2,048 threads per SM.

Choosing the size and shape of the grid and thread blocks is a balancing act
between maximizing the number of threads that can execute concurrently and
minimizing the amount of time spent waiting for data to be loaded from memory.
Factors such as the size of the data to process, the number of SMs on the GPU,
and the memory bandwidth of the GPU can all play a role in determining the
optimal grid and thread block dimensions. One general guideline is to choose a
thread block size that is a multiple of the warp size. This helps to maximize
the utilization of the GPU's resources and minimizes the overhead of managing
multiple warps.

:::

Now that you understand how to manage grid dimensions, let's get ready to create
a kernel that performs a simple element-wise addition of two vectors of floating
point numbers.

## 6. Allocate host memory for the input vectors

Before creating the two input vectors for our kernel function, we need to
understand the distinction between *host memory* and *device memory*. Host
memory is dynamic random-access memory (DRAM) accessible by the CPU, whereas
device memory is DRAM accessible by the GPU. If you have data in host memory,
you must explicitly copy it to device memory before you can use it in a kernel
function. Similarly, if your kernel function produces data that you want the CPU
to use later, you must explicitly copy it back to host memory.

For this tutorial, we'll use the
[`HostBuffer`](/docs/std/gpu/host/device_context/HostBuffer/) type to
represent our vectors on the host. A `HostBuffer` is a block of host memory
associated with a particular `DeviceContext`. It supports methods for
transferring data between host and device memory, as well as a basic set of
methods for accessing data elements by index and for printing the buffer.

Let's update `main()` to create two `HostBuffer`s for our input vectors and
initialize them with values. You won't need the `print_threads()` kernel
function anymore, so you can remove it and the code to compile and invoke it.

After making these changes, your `vector_addition.mojo` file should look like
this:

```mojo title="vector_addition.mojo"
from std.gpu.host import DeviceContext
from std.gpu import block_idx, thread_idx
from std.sys import has_accelerator

# Vector data type and size
comptime float_dtype = DType.float32
comptime vector_size = 1000

def main() raises:
    comptime if not has_accelerator():
        print("No compatible GPU found")
    else:
        # Get the context for the attached GPU
        ctx = DeviceContext()

        # Create HostBuffers for input vectors
        lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
        rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
        ctx.synchronize()

        # Initialize the input vectors
        for i in range(vector_size):
            lhs_host_buffer[i] = Float32(i)
            rhs_host_buffer[i] = Float32(Float64(i) * 0.5)

        print("LHS buffer: ", lhs_host_buffer)
        print("RHS buffer: ", rhs_host_buffer)
```

The
[`enqueue_create_host_buffer()`](/docs/std/gpu/host/device_context/DeviceContext/#enqueue_create_host_buffer)
method accepts the data type as a compile-time parameter and the size of the
buffer as a run-time argument and returns a `HostBuffer`. As with all
`DeviceContext` methods whose name starts with `enqueue_`, the method is
asynchronous and returns immediately, adding the operation to the queue to be
executed by the `DeviceContext`. Therefore, we need to call the `synchronize()`
method to ensure that the operation has completed before we use the `HostBuffer`
object. Then we can initialize the input vectors with values and print them.

Now let's run the program to verify that everything is working so far.

```bash
mojo vector_addition.mojo
```

You should see the following output:

```output
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
```

:::note

You might notice that we don't explicitly call any methods to free the host
memory allocated by our `HostBuffer`s. That's because a `HostBuffer` is subject
to Mojo's standard ownership and lifecycle mechanisms. The Mojo compiler
analyzes our program to determine the last point that the owner of or a
reference to an object is used and automatically adds a call to the object's
destructor. In our program, we last reference the buffers at the end of our
program's `main()` method. However, in a more complex program, the `HostBuffer`
could persist across calls to multiple kernel functions if it is referenced at
later points in the program. See the [Ownership](/docs/manual/values/ownership/)
and [Intro to value lifecycle](/docs/manual/lifecycle/) sections of the Mojo
Manual for more information on Mojo value ownership and value lifecycle
management.

:::

## 7. Copy the input vectors to GPU memory and allocate an output vector

Now that we have our input vectors allocated and initialized on the CPU, let's
copy them to the GPU so they'll be available for the kernel function to use.
While we're at it, we'll also allocate memory on the GPU for the output
vector that will hold the result of the kernel function.

Add the following code to the end of the `main()` function:

```mojo title="vector_addition.mojo"
        # Create DeviceBuffers for the input vectors
        lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
        rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)

        # Copy the input vectors from the HostBuffers to the DeviceBuffers
        ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
        ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)

        # Create a DeviceBuffer for the result vector
        result_device_buffer = ctx.enqueue_create_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
```

The [`DeviceBuffer`](/docs/std/gpu/host/device_context/DeviceBuffer/) type is
analogous to the `HostBuffer` type, but represents a block of device memory
associated with a particular `DeviceContext`. Specifically, the buffer is
located in the device's *global memory* space, which is accessible by all
threads executing on the device. As with a `HostBuffer`, a `DeviceBuffer` is
subject to Mojo's standard ownership and lifecycle mechanisms. It persists until
it is no longer referenced in the program or until the `DeviceContext` itself
is destroyed.

The
[`enqueue_create_buffer()`](/docs/std/gpu/host/device_context/DeviceContext/#enqueue_create_buffer)
method accepts the data type as a compile-time parameter and the size of the
buffer as a run-time argument and returns a `DeviceBuffer`. The operation is
asynchronous, but we don't need to call the `synchronize()` method yet because
we have more operations to add to the queue.

The
[`enqueue_copy()`](/docs/std/gpu/host/device_context/DeviceContext/#enqueue_copy)
method is overloaded to support copying from host to device, device to host, or
even device to device for systems that have multiple GPUs. In this example, we
use it to copy the data in our `HostBuffer`s to the `DeviceBuffer`s.

:::note

Both `DeviceBuffer` and `HostBuffer` also include
[`enqueue_copy_to()`](/docs/std/gpu/host/device_context/DeviceBuffer/#enqueue_copy_to)
and
[`enqueue_copy_from()`](/docs/std/gpu/host/device_context/DeviceBuffer/#enqueue_copy_from)
methods. These are simply convenience methods that call the `enqueue_copy()`
method on their corresponding `DeviceContext`. Therefore, we could have written
the copy operations in the previous example with the following equivalent code:

```mojo
    lhs_host_buffer.enqueue_copy_to(dst=lhs_device_buffer)
    rhs_host_buffer.enqueue_copy_to(dst=rhs_device_buffer)
```

:::

## 8. Create `TileTensor` views

One last step before writing the kernel function is that we're going to create a
[`TileTensor`](/docs/layout/tile_tensor/TileTensor/) view for each
of the vectors. `TileTensor` provides a powerful abstraction for
multi-dimensional data with precise control over memory organization. It
supports various memory layouts (row-major, column-major, tiled),
hardware-specific optimizations, and efficient parallel access patterns.
We don't need all of these features for this tutorial, but in more
complex kernels, it's a useful tool for manipulating data. Even though it
isn't strictly necessary for this example, we'll use `TileTensor` so that
you can get familiar with it.

First, add the following import to the top of the file:

```mojo title="vector_addition.mojo"
from layout import TileTensor, row_major
```

A [`Layout`](/docs/layout/tile_layout/Layout/) is a representation of memory
layouts using shape and stride information, and it maps between logical
coordinates and linear memory indices. We'll need to use the same `Layout`
definition multiple times, so add the following `comptime` value to the top of
the file after the other `comptime` declarations:

```mojo title="vector_addition.mojo"
comptime layout = row_major[vector_size]()
```

Finally, add the following code to the end of the `main()` function
to create `TileTensor` views for each of the vectors:

```mojo title="vector_addition.mojo"
        # Wrap the DeviceBuffers in TileTensors
        lhs_tensor = TileTensor(lhs_device_buffer, layout)
        rhs_tensor = TileTensor(rhs_device_buffer, layout)
        result_tensor = TileTensor(result_device_buffer, layout)
```

## 9. Define the vector addition kernel function

Now we're ready to write the kernel function. First, add the following imports
(note that we've added `block_dim` to the list of imports from `gpu`):

```mojo title="vector_addition.mojo"
from std.gpu import block_dim, block_idx, thread_idx
from std.math import ceildiv
```

Then, add the following code to `vector_addition.mojo` just before the
`main()` function:

```mojo title="vector_addition.mojo"
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
comptime block_size = 256
comptime num_blocks = ceildiv(vector_size, block_size)

def vector_addition(
    lhs_tensor: TileTensor[float_dtype, type_of(layout), MutAnyOrigin],
    rhs_tensor: TileTensor[float_dtype, type_of(layout), MutAnyOrigin],
    out_tensor: TileTensor[float_dtype, type_of(layout), MutAnyOrigin],
):
    """Calculate the element-wise sum of two vectors on the GPU."""

    # Calculate the index of the vector element for the thread to process
    var tid = block_idx.x * block_dim.x + thread_idx.x

    # Don't process out of bounds elements
    if tid < vector_size:
        out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
```

Our `vector_addition()` kernel function accepts the two input tensors and the
output tensor as arguments. We also need to know the size of the vector (which
we've defined with the `comptime` value `vector_size`) because it might not be a
multiple of the block size. In fact, in this example, the size of the vector is
1,000, which is not a multiple of our block size of 256. So as we assign our
threads to read elements from the tensor, we need to make sure we don't overrun
the bounds of the tensor.

The body of the kernel function starts by calculating the linear index of the
tensor element that a particular thread is responsible for. The `block_dim`
object (which we added to the list of imports) contains the dimensions of the
thread blocks as `x`, `y`, and `z` values. Because we're going to use a
one-dimensional grid of thread blocks, we need only the `x` dimension. We can
then calculate `tid`, the unique "global" index of the thread within the output
tensor, as `block_dim.x * block_idx.x + thread_idx.x`. For example, the `tid`
values for the threads in the first thread block range from 0 to 255, the `tid`
values for the threads in the second thread block range from 256 to 511, and so
on.

:::note

As a convenience, the [`gpu`](/docs/std/gpu/) package includes a `global_idx`
`comptime` value that contains the unique "global" `x`, `y`, and `z` indices of
the thread within the grid of thread blocks. So for our one-dimensional grid of
one-dimensional thread blocks, `global_idx.x` is equivalent to the value of
`tid` that we calculated above. However, for this tutorial, it's best that you
learn how to calculate `tid` manually so that you understand how the grid and
thread block dimensions work.

:::

The function then checks if the calculated `tid` is less than the size of the
output tensor. If it is, the thread reads the corresponding elements from the
`lhs_tensor` and `rhs_tensor` tensors, adds them together, and stores the result
in the corresponding element of the `out_tensor` tensor.

## 10. Invoke the kernel function and copy the output back to the CPU

The last step is to compile and invoke the kernel function, then copy the output
back to the CPU. Add the following code to the end of the `main()` function:

```mojo title="vector_addition.mojo"
        # Compile and enqueue the kernel
        ctx.enqueue_function[vector_addition](https://mojolang.org/docs/manual/gpu/lhs_tensor,
            rhs_tensor,
            result_tensor,
            grid_dim=num_blocks,
            block_dim=block_size,.md)

        # Create a HostBuffer for the result vector
        result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)

        # Copy the result vector from the DeviceBuffer to the HostBuffer
        ctx.enqueue_copy(
            dst_buf=result_host_buffer, src_buf=result_device_buffer
        )

        # Finally, synchronize the DeviceContext to run all enqueued operations
        ctx.synchronize()

        print("Result vector:", result_host_buffer)
```

### Click here to see the complete version of `vector_addition.mojo`.

```mojo title="vector_addition.mojo"
from std.math import ceildiv
from std.sys import has_accelerator
from std.gpu.host import DeviceContext
from std.gpu import block_dim, block_idx, thread_idx
from layout import TileTensor, row_major
# Vector data type and size
comptime float_dtype = DType.float32
comptime vector_size = 1000
comptime layout = row_major[vector_size]()
# Calculate the number of thread blocks needed by dividing the vector size
# by the block size and rounding up.
comptime block_size = 256
comptime num_blocks = ceildiv(vector_size, block_size)
def vector_addition(
lhs_tensor: TileTensor[float_dtype, type_of(layout), MutAnyOrigin],
rhs_tensor: TileTensor[float_dtype, type_of(layout), MutAnyOrigin],
out_tensor: TileTensor[float_dtype, type_of(layout), MutAnyOrigin],
):
"""Calculate the element-wise sum of two vectors on the GPU."""
# Calculate the index of the vector element for the thread to process
var tid = block_idx.x * block_dim.x + thread_idx.x
# Don't process out of bounds elements
if tid < vector_size:
out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid]
def main() raises:
comptime if not has_accelerator():
print("No compatible GPU found")
else:
# Get the context for the attached GPU
ctx = DeviceContext()
# Create HostBuffers for input vectors
lhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
rhs_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
ctx.synchronize()
# Initialize the input vectors
for i in range(vector_size):
lhs_host_buffer[i] = Float32(i)
rhs_host_buffer[i] = Float32(Float64(i) * 0.5)
print("LHS buffer: ", lhs_host_buffer)
print("RHS buffer: ", rhs_host_buffer)
# Create DeviceBuffers for the input vectors
lhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
rhs_device_buffer = ctx.enqueue_create_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
# Copy the input vectors from the HostBuffers to the DeviceBuffers
ctx.enqueue_copy(dst_buf=lhs_device_buffer, src_buf=lhs_host_buffer)
ctx.enqueue_copy(dst_buf=rhs_device_buffer, src_buf=rhs_host_buffer)
# Create a DeviceBuffer for the result vector
result_device_buffer = ctx.enqueue_create_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
# Wrap the DeviceBuffers in TileTensors
lhs_tensor = TileTensor(lhs_device_buffer, layout)
rhs_tensor = TileTensor(rhs_device_buffer, layout)
result_tensor = TileTensor(result_device_buffer, layout)
# Compile and enqueue the kernel
ctx.enqueue_function[vector_addition](https://mojolang.org/docs/manual/gpu/lhs_tensor,
rhs_tensor,
result_tensor,
grid_dim=num_blocks,
block_dim=block_size,.md)
# Create a HostBuffer for the result vector
result_host_buffer = ctx.enqueue_create_host_buffer[float_dtype](https://mojolang.org/docs/manual/gpu/vector_size.md)
# Copy the result vector from the DeviceBuffer to the HostBuffer
ctx.enqueue_copy(
dst_buf=result_host_buffer, src_buf=result_device_buffer
)
# Finally, synchronize the DeviceContext to run all enqueued operations
ctx.synchronize()
print("Result vector:", result_host_buffer)
```

The `enqueue_function()` method enqueues the compilation and invocation
of the `vector_addition()` kernel function, passing the input and output tensors
as arguments. The `grid_dim` and `block_dim` arguments use the `num_blocks` and
`block_size` `comptime` values we defined in the previous step.

After the kernel function has been compiled and enqueued, we create a
`HostBuffer` to hold the result vector. Then we copy the result vector from the
`DeviceBuffer` to the `HostBuffer`. Finally, we synchronize the `DeviceContext`
to run all enqueued operations. After synchronizing, we can print the result
vector to the console.

At this point, the Mojo compiler determines that the `DeviceContext`, the
`DeviceBuffer`s, the `HostBuffer`s, and the `TileTensor`s are no longer used
and so it automatically invokes their destructors to free their allocated
memory. (For a detailed explanation of object lifetime and destruction in Mojo,
see the [Death of a value](/docs/manual/lifecycle/death/) section of the Mojo
Manual.)

Now it's time to run the program to see the results of our work.

```bash
mojo vector_addition.mojo
```

You should see the following output:

```output
LHS buffer:  HostBuffer([0.0, 1.0, 2.0, ..., 997.0, 998.0, 999.0])
RHS buffer:  HostBuffer([0.0, 0.5, 1.0, ..., 498.5, 499.0, 499.5])
Result vector: HostBuffer([0.0, 1.5, 3.0, ..., 1495.5, 1497.0, 1498.5])
```

And now that you're done with the tutorial, exit your project's virtual
environment:

```bash
exit
```

## Summary

In this tutorial, we've learned how to use Mojo's `gpu.host` package to write a
simple kernel function that performs an element-wise addition of two vectors. We
covered:

- Understanding basic GPU concepts like devices, grids, and thread blocks.
- Moving data between CPU and GPU memory.
- Writing and compiling a GPU kernel function.
- Executing parallel computations on the GPU.

## Next steps

Now that you understand the basics of GPU programming with Mojo, here are some
suggested next steps:

  
  
</ListingCards>
