Building a GPU Scheduler

Motivation

GPUs are weirdly easy to underutilize.

You can have expensive hardware sitting there, perfectly healthy, while requests pile up in front of it because the system feeding it is too naive. Or the opposite happens: you throw too much work at a GPU at once, memory blows up, latency becomes unpredictable, and the whole system turns fragile.

That trade-off is what made this project interesting to me.

For this project, I wanted to build a small scheduler in Rust that models a very real systems problem: how do you schedule GPU jobs efficiently when jobs arrive over time, have different memory requirements, and benefit from batching?

This post walks through the design I landed on: a scheduler with queues, batching, VRAM-aware dispatch, and worker simulation.

At a high level, the scheduler answers three questions:

Which job should run next?
When should we wait to form a better batch?
Which GPU can actually fit that work?

Overview

The project lives here: gpu-scheduler (or locally, in my case, as a small Rust crate).

The setup is intentionally simple. I model three GPU workers:

two A100-like workers with 40 GB of VRAM
one L4-like worker with 24 GB of VRAM

Each incoming job has four main pieces of metadata:

a job ID
a queue name
a VRAM requirement
an expected runtime

The queue name represents work that is batch-compatible. For example, requests for the same model family can often be grouped into one execution batch, while unrelated requests cannot.

That means the core abstraction is not just “a list of jobs.” It’s a set of per-queue waiting lines.

Why batching matters

If you run one inference request at a time, the GPU spends too much time paying fixed overheads: kernel launches, model setup, memory movement, and framework overhead.

Batching improves throughput because multiple requests share that fixed cost.

But there is a catch.

If you wait forever, latency becomes terrible. If you dispatch immediately, you leave throughput on the table. A scheduler has to sit right in the middle of that trade-off.

The policy I used is simple:

dispatch a batch once it reaches max_batch_size
or dispatch it once the oldest job has waited longer than max_batch_wait
or flush whatever remains when the queue is shutting down

This gives the scheduler two useful modes at once:

throughput mode when traffic is high and batches fill naturally
latency protection when traffic is sparse and the system should stop waiting

The queueing model

Internally, the scheduler stores pending work as a HashMap<String, VecDeque<GpuJob>>.

That gives each queue its own FIFO ordering while still letting the scheduler rotate across queues.

I also keep a queue_order deque, which acts like a round-robin list of non-empty queues. When a worker becomes idle, the scheduler walks this deque, checks which queue is ready, and tries to build a batch from it.

That detail matters because I did not want batches to mix unrelated queues together. If one queue represents llama3-8b requests and another represents sdxl, combining them would be unrealistic.

So batching is queue-local.

Every batch comes from exactly one queue.

VRAM-aware dispatch

Not every worker can run every job.

If a job requires 24 GB of VRAM, it cannot run on a small worker. If a batch contains multiple jobs, their combined memory footprint also has to fit on the target worker.

So before dispatching a batch, the scheduler checks two constraints:

each job in the batch must fit on the worker
the total VRAM of the batch must stay under that worker’s capacity

Jobs that exceed the maximum VRAM in the entire cluster are rejected immediately.

That rejection path is important. A scheduler should not let impossible work sit in the queue forever.

Worker simulation

For this proof of work, the workers are simulated rather than connected to a real inference backend.

Each worker gets its own async task and receives JobBatch values over a Tokio mpsc channel. When a batch arrives, the worker sleeps for an estimated runtime and then sends a completion event back to the scheduler.

Why simulate workers instead of calling a real GPU runtime?

I wanted to focus first on the scheduling logic itself: queueing, batching, dispatch, and fairness.

Once those mechanics are stable, swapping the sleep-based worker for a real HTTP/gRPC call is mostly an integration step. The hard systems question here is not “how do I call a model server?” It is “how do I decide what work should run, where, and when?”

This design creates a clean feedback loop:

clients submit jobs into the scheduler
the scheduler groups jobs into batches and assigns them to workers
workers report completion
completed workers go back into the idle pool

The main scheduling loop

The scheduler itself is one async event loop driven by three kinds of events:

new job arrivals
worker completions
periodic dispatch ticks

That means the scheduler is always reacting to fresh state. It doesn’t need a heavyweight orchestration layer. It just needs to keep enough metadata around to make the next good decision.

In pseudocode, the loop looks like this:

loop {
    select! {
        job = recv_new_job() => enqueue(job),
        completion = recv_completion() => mark_worker_idle(completion),
        _ = tick() => {}
    }

    dispatch_ready_batches();
}

What I like about this pattern is that it stays small even as the logic gets richer.

The complexity lives in the scheduling policy, not in the control flow.

Throughput vs latency

This project made one thing very obvious: schedulers are mostly about trade-offs.

If you increase batch size, throughput usually improves, but some jobs wait longer.

If you decrease max_batch_wait, latency improves, but you dispatch more partial batches and lose some efficiency.

If you have heterogeneous workers, placement also matters. A larger GPU can absorb more jobs in a batch, which improves throughput, but it may also be the only place that can run certain jobs at all.

Even in this small simulation, those tensions show up immediately.

With one sample run:

32 jobs were enqueued
all 32 were completed
work was grouped into 14 batches
average wait time landed around 122 ms
throughput landed around 50.87 jobs/s

Those numbers are simulation results, not production benchmarks, but they still tell the right story: batching helps, but only if the scheduler knows when to stop waiting.

Testing the scheduler

I added tests for the core scheduling behaviors:

jobs from the same queue batch together
partial batches flush once max_batch_wait is exceeded
queues stay isolated from one another during batching
oversized jobs are rejected instead of blocking the system

That test suite matters because scheduler bugs are rarely obvious. A system can “work” while quietly violating fairness, starving a queue, or building invalid batches.

Testing the policy directly is much better than just staring at logs and hoping the runtime behavior looks reasonable.

Why Rust felt good here

Rust is a very natural fit for this kind of project.

The data model is explicit, ownership makes state transitions easier to reason about, and Tokio gives a clean way to build event loops, channels, and worker tasks without making the code feel magical.

I also like that Rust nudges you toward making invalid states harder to represent. That becomes useful quickly in schedulers, where there are lots of moving parts: idle workers, active workers, queue state, batch state, and completion events.

Reflections

What I enjoyed most about this project is that it sits right at the intersection of systems and product thinking.

On the systems side, it is about queues, backpressure, resource constraints, and event-driven control flow.

On the product side, it is really about user experience. Do requests complete quickly? Do they queue fairly? Does the platform use hardware efficiently? Can it handle bursts gracefully?

That is what makes scheduling fun. It looks like infrastructure, but it directly shapes the experience that users feel.

This proof of work is still a simulation, but I like it because it captures the real heart of the problem. Once jobs arrive continuously and resources are constrained, the question is no longer just “can I run this?” It becomes “what should run next, and what am I optimizing for?”

And that, to me, is where systems start to get interesting.