Building a GPU Kernel and a Collective in Rust

Motivation

I’ve been wanting to understand the systems side of machine learning more deeply.

A lot of modern ML feels magical from the outside. You call a high-level API, tensors move around, kernels launch, collectives synchronize, and everything somehow works. But underneath that convenience is a lot of low-level coordination: memory movement, numerical stability, synchronization, communication patterns, and hardware-aware programming.

This project was my attempt to get closer to that layer.

I built two small but important pieces:

a fused row-wise softmax kernel in CUDA C
a ring allreduce in pure Rust using threads and message passing

Neither of these is large enough to be production infrastructure. That wasn’t the point. I wanted to understand the shape of the problems that show up in inference systems: how work is split, how data moves, and where the tricky details live.

Overview

Softmax and allreduce show up everywhere in modern ML systems.

Softmax turns logits into probabilities. In attention, it is part of the core inner loop. It also has a subtle requirement: it needs to be numerically stable. If you compute exp(x) directly on large values, things can blow up quickly.

AllReduce is a collective communication pattern. Multiple workers each hold some local data, and after the operation, every worker has the same reduced result. In practice, this is used heavily in distributed training and inference systems.

This post is not about building a full distributed runtime or a highly optimized attention kernel. It’s about understanding the core mechanics through two focused implementations.

Why these two pieces?

I picked these because they sit on opposite sides of systems work.

The softmax kernel is about compute inside one device. You care about threads, warps, reductions, and memory hierarchy.

The ring allreduce is about coordination across workers. You care about topology, chunk movement, and synchronization between participants.

Together, they make a nice introduction to the kind of thinking that kernel and infrastructure work demands.

Setup

I used Rust for the host-side code and CUDA C for the kernel.

The CUDA kernel lives in kernels/softmax.cu. Rust calls it through cudarc, which is gated behind a cuda feature so the project still builds on machines without CUDA.

For the collective, I stayed entirely on the CPU. I simulated workers with Rust threads and connected them with std::sync::mpsc channels in a ring.

That split was intentional. I wanted the GPU part to focus on kernels, and the distributed part to focus on communication.

Ring AllReduce

The ring allreduce was the part that first made the communication pattern click for me.

At a high level, each worker starts with its own vector. The goal is for everyone to end up with the element-wise sum of all vectors.

I implemented the ring algorithm in two phases:

scatter-reduce
allgather

In scatter-reduce, workers pass chunks around the ring and add what they receive into their local buffer. After N - 1 steps, each worker owns one fully reduced chunk.

In allgather, workers circulate those finished chunks so everyone reconstructs the full reduced vector.

What I like about this algorithm is that it forces you to think in terms of ownership and movement. No worker needs the full answer immediately. The final result emerges from local interactions.

Why use threads and channels instead of processes?

I used threads because I wanted a simple single-node setup and easy local testing.

The important idea here isn’t process isolation. It’s message passing. Channels make the communication pattern explicit, which maps nicely to how real distributed collectives work across machines.

The function signature ended up looking like this:

data: &mut Vec<f32> for the worker’s local buffer
rank and world_size to identify the worker in the ring
send and recv channel endpoints for right and left neighbors

That was a useful design constraint. Once I committed to the ring topology, the implementation became much easier to reason about.

The trickiest part of the ring

The hard part wasn’t spawning threads. It was indexing the right chunk at the right step.

At each step, a worker sends one chunk and receives another. Those chunk indices rotate around the ring, and if you get the arithmetic wrong, the whole algorithm quietly falls apart.

What helped me most was reducing the problem to a tiny example with 3 workers and 3-element vectors. Once I manually tracked which worker should send chunk 0, chunk 1, and chunk 2 at each step, the modulo arithmetic finally made sense.

That was a recurring pattern in this project: when the abstraction felt slippery, small concrete examples made it solid.

Softmax on the GPU

The softmax kernel taught me a different lesson: GPUs reward structure.

The operation itself is straightforward:

find the max value in a row
subtract it for numerical stability
compute exponentials
sum them
divide each value by the sum

The tricky part is that the max and the sum are both reductions across the row. If one thread block handles one row, the threads inside that block need to cooperate.

So the kernel does three passes over the row:

pass 1: compute the row max
pass 2: compute the sum of exp(x - max)
pass 3: write normalized outputs

This is the stable form of softmax. Without subtracting the max first, large logits can overflow during exponentiation.

Warp-level and block-level reductions

This was the most GPU-specific part of the project.

Each thread processes a strided subset of columns. That gives a local partial max or partial sum. Then those thread-local values need to be reduced across the block.

I used a two-level strategy:

warp-level reduction with __shfl_down_sync
block-level reduction with shared memory

Within a warp, threads can exchange register values directly. That’s fast and avoids shared memory for the first stage of the reduction.

Then each warp writes one partial result into shared memory, and the first warp reduces those partials down to the final block-wide value.

This pattern showed up twice: once for max, once for sum.

Why not just use shared memory for everything?

You can, and it works.

But warp shuffle instructions are a good fit for reductions because they let threads inside a warp communicate directly through registers. It’s a cleaner and often faster first stage before combining results across warps.

Calling CUDA from Rust

I wanted the Rust side to stay simple.

The project uses build.rs to compile softmax.cu into PTX with nvcc if CUDA is available. Then Rust loads that PTX through cudarc, allocates device buffers, launches the kernel, and copies the result back.

One small detail that mattered: I gated the CUDA path behind a Cargo feature so cargo check still works on machines without NVIDIA hardware.

That made development much smoother, especially since I was working locally on a MacBook and only using a remote 3090 box for CUDA testing.

What broke

A few issues were especially instructive.

The first was toolchain compatibility. The remote machine had CUDA 13 installed, which meant the older cudarc version I started with didn’t work. Upgrading that dependency was necessary before the CUDA path would even compile.

The second was kernel symbol lookup. The kernel needed to be exported with extern "C" so the Rust side could load it by name from the PTX.

The third was more subtle: my launch configuration allowed fewer than one full warp for small row sizes. The reduction logic assumed warp-sized execution, so that produced bad outputs until I fixed the thread count.

I like these bugs because they weren’t random. Each one pointed directly at a systems boundary: toolchain, ABI, or execution model.

Verification

I tested the ring allreduce locally and the CUDA path on a remote RTX 3090.

For allreduce, I started with a small correctness case:

worker 0: [10, 20, 30]
worker 1: [1, 2, 3]
worker 2: [4, 5, 6]

The expected reduced result is [15, 27, 39], and every worker should end up with that vector.

For softmax, I used simple checks that are easy to reason about:

each row should sum to 1
uniform input should produce uniform probabilities
multiple rows should each normalize independently

Those tests were much more useful than starting with performance numbers. First I needed to trust the implementations.

Benchmarks

I measured two things:

isolated ring allreduce latency on my machine
isolated softmax kernel latency on a remote RTX 3090

This is much better than timing the entire binary from process start to exit. For the ring benchmark, I timed repeated allreduce runs inside one process. For the CUDA benchmark, I reused one CUDA context and one loaded module, then timed repeated kernel launches with synchronization per iteration.

Ring allreduce (M2 Pro, macOS)

World Size	Vector Length	Median	P95	Min	Max
4	1,024	110.00 us	177.38 us	88.92 us	182.67 us
4	16,384	206.29 us	267.71 us	151.50 us	295.50 us
8	16,384	356.17 us	433.12 us	304.00 us	456.04 us
8	262,144	1109.83 us	1551.67 us	945.46 us	1766.54 us

These numbers look how I would expect a thread-and-channel simulation to look: more workers and larger vectors both push the latency up, but the growth is smooth and easy to reason about.

Softmax kernel (RTX 3090)

Rows	Cols	Median	P95	Min	Max
128	128	6.80 us	7.46 us	6.63 us	13.36 us
1024	1024	18.77 us	19.53 us	18.08 us	25.87 us
4096	1024	48.93 us	49.73 us	48.47 us	52.13 us

This result is much more satisfying than the earlier end-to-end timings because it actually reflects the kernel path itself. As the matrix grows, latency rises, but it stays in the low-microsecond to tens-of-microseconds range for the shapes I tested.

I still wouldn’t present this as a production-grade performance study. There is no baseline against a naive GPU implementation, no comparison against a framework kernel, and no throughput or bandwidth analysis yet. But as a focused systems project, these measurements finally match the thing I wanted to understand.

What this project taught me

The biggest lesson here is that systems work becomes much easier once you stop treating it like magic.

A collective is just a communication pattern with local rules.

A GPU kernel is just a lot of threads cooperating under hardware constraints.

Of course, the details matter. Very quickly, you run into numerical stability, reduction structure, launch configuration, symbol loading, and compatibility issues. But those details start feeling much less mysterious once you’ve built a small version yourself.

That was the real value of this project for me. Not that I built the world’s fastest softmax or a production-grade collective, but that I now have a much more concrete mental model of how these pieces work.

And that’s exactly the kind of understanding I was hoping to build.

If you want to explore the code, the project is on GitHub.