Image Processing without OOM

Inspired by Daft’s “Processing 300K Images Without OOM” by Colin Ho: https://www.daft.ai/blog/processing-300k-images-without-oom

Motivation

I’ve always found distributed systems fascinating–the idea that you can coordinate work across machines, handle failures gracefully, and scale to massive datasets. This blog post walks through the first step by building a system capable of processing millions of images efficiently.

What surprised me most is how quickly memory spikes show up once you let multiple decoded images exist at the same time. You can’t just throw concurrency at a problem and hope it works. You have to actively cap it to keep memory stable.

The roots of distributed systems live in these same concerns, scaled across machines over a network. The patterns I explore here–streaming pipelines, semaphores, backpressure–show up in real-world systems like MapReduce, Apache Spark, and any system that moves data through a pipeline. Starting with concurrency provides a great foundation for understanding them.

I hope I can share this fascination with you today.

👋 Click me!

I wanted this post to be as accessible as possible. If you’re newer to systems or async programming, the collapsible sections cover background concepts and definitions to bring you up to speed.

Overview

A digital image is a noisy sample of the real world. To make it useful for tasks like machine learning, we process it into inputs that models can consume. Image processing can include resizing, color conversion, and intensity normalization.

In this blog post, when I say image processing, I mean resizing images. The goal isn’t to understand the specific details, but to see how it scales with many inputs and what bottlenecks show up.

This project is called Flux because the core idea is to keep data flowing through a pipeline instead of batching everything into a single step. The pipeline stays in motion, and so does memory.

I compare three approaches: naive, batched, and streamed processing.

OOM and Memory Inflation

Before I explain the multi-tier approach, let’s define OOM: out of memory. Daft’s article explains it well, and here’s a summary:

When you process a single image, memory usage grows like this: URL (1 KB) -> JPEG file (500 KB) -> decoded pixels (3-12 MB) -> resizing (temporary buffers double usage) -> final encoding (back to ~200 KB).

Why decode and encode?

JPEG is compressed. Operations like resizing work only on raw pixels, so we decode, resize, then encode (compress) back to JPEG. The finer details of such operations have their roots in information theory, which I’ll save for another day.

Each image goes through four steps: download, decode, resize, save. The tiers differ in how they handle throughput, speed, and memory. Now let’s go through each one of them.

Setup

Here’s the setup I’m using to demonstrate the tiered approach:

Placeholder images from Lorem Picsum
Rust + Tokio for processing. Tokio is an async Rust runtime that uses green threads and has its own scheduler to manage spawned tasks.

Concurrency vs Parallelism

These terms are used interchangeably but there’s a very important difference between the two.

A computer program runs on a process. A process can spawn multiple threads for different tasks. These threads share the same resources, such as RAM, CPU, etc.

Concurrent programs allow these threads to share the same resources by breaking up the program into parts that can be spaced apart in time. An ideal concurrent program would ensure that at every point in time, some thread is utilizing resources and there’s no idle time spent waiting.

Parallel programs are the ones that are truly parallel. They harness the power of multiple cores on a CPU and actually ensure that threads using the same resources run simultaneously.

Naive Processing

This tier is straightforward: loop over URLs, download, decode, resize, save. It’s simple and slow. It also makes it easy to see the raw memory inflation problem because each image is processed in isolation. The downside is that all I/O and CPU work is serialized (one after another, never overlapping), so throughput is terrible.

Batched Processing

This tier runs naive processing concurrently for a fixed batch size. We build a vector of URLs, chunk into batches, and spawn one task per URL. We collect the tasks’ handles, then await the batch with join_all.

What are handles?

When we spawn a new task using tokio::spawn, we get a return type of JoinHandle<()>. This type has multiple methods, one of them being join, which essentially tells Tokio to hold the execution of the program until the thread corresponding to that handle has completed execution.

In Flux, since I’ve spawned multiple threads, a better way to join them is to use join_all, which takes a vector of handles and waits for all of them concurrently. This is more ergonomic and efficient than looping with individual join calls because it waits for all tasks in a single operation rather than sequentially waiting for each one.

Meaning of await

In an async function, await yields control back to the runtime until the awaited task is ready. This lets other tasks run instead of blocking the thread.

This repeats for each batch, which boosts throughput because more images are processed per unit time.

Are the tasks in a batch concurrent or parallel?

Batched processing is concurrent. Tokio schedules async tasks on a worker threadpool. If you have multiple CPU cores, some work can run in parallel (since tasks can be executed on multiple OS threads, which run on multiple cores). But the core idea here is concurrency: tasks interleave and share time on the same pool.

Streamed Processing

This tier breaks processing into stages that use different resources. The key idea is to overlap I/O and CPU work while keeping memory bounded.

Flux streaming pipeline

Download: this sends a GET request to Lorem Picsum. It is I/O-bound and can be awaited.

Process (resize): this is CPU-bound. It can’t be async because there’s no waiting involved.

CPU-bound vs I/O-bound Tasks

CPU-bound tasks utilize the CPU and block the thread. They cannot be awaited because they’re constantly using resources and don’t wait on anything.

I/O-bound tasks are usually waiting on input or output. The CPU isn’t being used constantly. Downloading an image is an I/O-bound task that relies on the network card to do the heavy lifting while the program can do something else in the meantime.

Here’s how it works. We have two bounded channels (supporting backpressure): one for downloads, one for processing. The download stage produces into the first channel. The process stage consumes from it, resizes, and produces into the second channel. The save stage consumes that and writes to disk.

Why not just save during the process stage?

We keep saving in a separate stage so download/resize can continue without waiting on disk I/O.

Why use mpsc channels?

mpsc stands for multiple-producer-single-consumer. It’s used because multiple download tasks feed the downloads channel, and only the process stage consumes from it. Similarly, multiple process tasks feed the process channel, and only the save stage consumes from it one by one.

We also don’t want to spawn a download or resize task for every incoming item. That can explode memory usage. The fix is simple: use semaphores to cap concurrent work.

What is a semaphore?

A semaphore is a gate with a fixed number of permits. Each task must acquire a permit before it can run. If no permits are available, the task waits. This caps in-flight work (active tasks in the pipeline) and keeps memory stable.

Why not just use the threadpool? Tokio’s threadpool takes care of how many threads run simultaneously, not how many tasks can be spawned. You can still spawn infinite tasks and switch between them while all of them use memory. That doesn’t solve the memory explosion problem.

Semaphores are not to be confused with locks. Locks protect shared data access between threads so that only one thread can access it at a given time. Semaphores limit how many tasks can be spawned.

Benchmarks

Approach	Images	Time	Peak Mem	Avg DL	Avg Resize	Throughput
naive	1000	14.17 min	91 MB	0.302 s	0.415 s	1.18 img/s
batched	1000	2.44 min	210 MB	0.548 s	0.500 s	6.82 img/s
streaming	1000	1.30 min	263 MB	0.289 s	0.769 s	12.85 img/s

Summary

Benchmark results

Memory Usage Tracking

This went through multiple iterations. I started by sampling total system memory, which wasn’t useful. Then I sampled the memory used by the program using its process ID (PID). I initially tried measuring before and after each stage of the program. But that wasn’t useful either, since the true peak happens during decode/resize, not at the boundaries.

The current model is much more accurate. It spawns a monitor at the start of each tier and polls memory every 100 ms. The shared peak value lives in an Arc so it can be updated safely across threads.

Why Arc is required

The memory monitor runs in a spawned task that can outlive the function that created it, so the shared peak value must be owned and thread-safe. Arc (atomically reference counted) provides shared ownership across threads, and the atomic value allows concurrent updates without data races.

In plain terms, a data race is when two threads touch the same memory at the same time and at least one of them is writing. This can lead to unpredictable results.

Reflections

Streaming really does win on throughput, but only if you actively cap concurrency to keep memory stable. The key insight is that you can’t just throw more tasks at a problem–you need backpressure and bounded queues to prevent memory from exploding.

I hope this post inspires you to build your own experiments, benchmark them, and understand where your bottlenecks really come from. If you want to explore the code, you can find Flux on GitHub.