Playing with an Intel Optane on a Raspberry Pi 5

Checking out an Intel Optane module was always on my bucket list, but I never really had a solid use case for it. Recently, instead of buying a new Raspberry Pi with more physical RAM, I thought: why not give high-speed, ultra-low-latency swap a try?

To give you an idea of how everything is set up, here is an overview of the hardware and connectivity on my Raspberry Pi 5 (4 GB RAM). Keep in mind, this Pi is a multi-purpose machine and not a dedicated, beefy LLM server:

  • 64 GB MicroSD Card (System OS)
  • Intel SSDSC2BB150G7 (Intel SSD DC S3520 Enterprise SATA)
  • Intel SSDPEKKW256G8 (Intel SSD 760p Series M.2 NVMe – I run 4 of these in my storage pool)
  • Intel MEMPEK1W016GA (Intel Optane Memory 16 GB M.2)

Connectivity and Bus Mathematics

The way these drives are wired up matters immensely:

  1. The Intel DC SSD is housed in an external SATA-to-USB case, connected directly to the Pi’s USB 3.0 Port 1.
  2. The Intel 760p M.2 SSDs sit inside a M.2 USB enclosure connected via a USB hub to USB 3.0 Port 2.
  3. The Intel Optane is connected directly via a Geekworm X1005 Dual-HAT on the Pi’s PCIe interface (limited to PCIe Gen 2.0 speeds due to the onboard ASMedia PCIe switch).

Because of the USB 3.0 specification, each USB port is capped at a raw signaling rate of 5 Gbit/s. If you do the math:

(5×1000)÷8=625 MB/s(5 \times 1000) \div 8 = 625 \text{ MB/s}

However, due to the mandatory 8b/10b line encoding, you automatically lose 20% of that bandwidth just for clock synchronization, cutting the theoretical limit down to

625×0.8=500 MB/s625 \times 0.8 = 500 \text{ MB/s}

Once you factor in protocol headers, the xHCI software stack, and driver overhead, you will never see more than 430 to 450 MB/s in the real world over USB 3.0.

While the maximum sequential bandwidth of the PCIe Gen 2.0 x1 bus is mathematically identical (also capping out at 500 MB/s.. also with 8b/10b encoding), PCIe completely shifts the playing field when it comes to latency and parallelism:

  • Bus Transit Latency: ~20 µs for USB 3.0 vs. ~1.5 µs for PCIe (direct hardware MMIO access without a bridge-chip translator).
  • Queue Depth (IOPS): At a strict Queue Depth of 1, PCIe can handle up to 13 times more individual commands per second than USB 3.0 before the protocol stack chokes.

Phase 1: Sequential Bandwidth (hdparm)

Let’s look at a simple baseline test using hdparm to measure raw sequential reading speeds:

root@mayu:~# hdparm -tT --direct /dev/mmcblk0
/dev/mmcblk0:
 Timing O_DIRECT cached reads:   136 MB in 2.01 seconds =  67.52 MB/sec
 Timing O_DIRECT disk reads:     266 MB in 3.02 seconds =  88.15 MB/sec

root@mayu:~# hdparm -tT --direct /dev/sd[abcde]
/dev/sda: (Intel DC SSD)
 Timing O_DIRECT cached reads:   436 MB in 2.01 seconds = 216.99 MB/sec
 Timing O_DIRECT disk reads:     514 MB in 3.00 seconds = 171.30 MB/sec

/dev/sdb: (Consumer M.2 NVMe via USB)
 Timing O_DIRECT cached reads:   682 MB in 2.00 seconds = 340.60 MB/sec
 Timing O_DIRECT disk reads:    1040 MB in 3.00 seconds = 346.53 MB/sec
# (...sdc, sdd, sde show virtually identical values around 346 MB/s due to the USB limits)

root@mayu:~# hdparm -tT --direct /dev/nvme0n1
/dev/nvme0n1: (Intel Optane via PCIe HAT)
 Timing O_DIRECT cached reads:   820 MB in 2.00 seconds = 409.13 MB/sec
 Timing O_DIRECT disk reads:    1232 MB in 3.00 seconds = 410.22 MB/sec

While this simple sequential test shows that the Optane leads the pack at 410 MB/s, it doesn’t look like a total revolution compared to the 346 MB/s of the USB drives. But sequential testing is completely irrelevant for virtual memory. Swap workloads are random, chaotic, and heavily block-based.

In case you always wanted know, why people keep saying go for more memory instead:

root@mayu:~# hdparm -tT --direct /dev/ram0

/dev/ram0:
Timing O_DIRECT cached reads: 35642 MB in 1.99 seconds = 17890.54 MB/sec
Timing O_DIRECT disk reads: 4 MB in 0.00 seconds = 2906.98 MB/sec

# remember this is the lz4 zram swap device
root@mayu:~# hdparm -tT --direct /dev/zram0

/dev/zram0:
Timing O_DIRECT cached reads: 8618 MB in 2.00 seconds = 4312.43 MB/sec
Timing O_DIRECT disk reads: 2022 MB in 0.75 seconds = 2692.08 MB/sec

Phase 2: The Real-World Swap Benchmark (fio)

To simulate actual swap behavior, we need to look at a 16K Random Mixed Workload (70% Read / 30% Write). Why 16K? While x86 systems typically use 4K memory pages, the Raspberry Pi 5 kernel operates with a native 16K page size. You can verify this on your system using the getconf PAGESIZE command.

Here is the exact fio command used across all drives (executed inside an active filesystem mount point to protect existing data):

fio --name=swap_simulation \
    --ioengine=libaio \
    --direct=1 \
    --rw=randrw \
    --rwmixread=70 \
    --bs=16k \
    --numjobs=1 \
    --iodepth=4 \
    --size=512M \
    --runtime=60 \
    --time_based \
    --filename=/YOUR/MOUNT/POINT/test.fio

(Note: --direct=1 is critical here to bypass the Linux file system cache, forcing the I/O operations directly onto the physical device).

Storage MediumRead IOPS / BandwidthWrite IOPS / BandwidthRead Latency (Avg / 99th)Write Latency (Avg / 99th)
Samsung EVO SD Card496 / 7.75 MiB/s211 / 3.31 MiB/s3.43 ms / 21.1 ms6.52 ms / 60.0 ms
Consumer M.2 NVMe (USB)3,618 / 56.5 MiB/s1,556 / 24.3 MiB/s591 µs / 750 µs553 µs / 979 µs
Intel DC Enterprise (USB)5,006 / 78.2 MiB/s2,150 / 33.6 MiB/s704 µs / 2.41 ms171 µs / 302 µs
Intel Optane (PCIe Gen2 x1)11,700 / 182.0 MiB/s5,005 / 78.2 MiB/s217 µs / 429 µs278 µs / 490 µs

1. The Total Capitulation of the SD Card (mmcblk0)

Now we have the mathematical proof of why systems utterly stall when swapping onto SD cards. We are looking at a meager 496 Read IOPS and latency spikes soaring past 21 milliseconds (and worse, climbing over 60 ms for writes!). Every time the kernel looks for a memory page here, the CPU hits a brick wall and stands completely idle for an eternity. This is the definitive path to a total system freeze.

2. The USB Bottleneck and the Intel DC Phenomenon

Both the consumer M.2 and the Intel DC drive are visibly choking on the USB latency bottleneck we calculated earlier. However, take a closer look at the write latency of the Intel DC: it sits at a phenomenal average of just 171 µs (compared to 553 µs on the consumer M.2), with a 99th percentile of only 302 µs. This is where enterprise architecture relentlessly flexes its muscles. Thanks to massive, highly optimized controller cache structures and Power-Loss Protection (PLP), the DC’s enterprise controller acknowledges write operations instantly, while the consumer drive is still scrambling internally. Conversely, the DC’s higher read latency (704 µs) shows that its older SATA legacy design lacks the raw, parallel read capabilities of modern NVMe drives when forced through a USB bridge.

3. The Intel Optane Utterly Pulverizes the Field

The performance of the Optane module is nothing short of spectacular.

  • The IOPS King: Pounding out 11,700 Read IOPS, it delivers more than double the operations of our enterprise SSD over USB and a staggering 23x performance boost over the SD card.
  • Bus Saturation: If we combine the total throughput, we get 182.0 MiB/s (Read) + 78.2 MiB/s (Write) = 260.2 MiB/s of net bandwidth. Keep in mind, this is under a brutal, mixed random 16K workload! Considering that the absolute theoretical maximum of a PCIe Gen2 x1 bus tops out at around 400–430 MB/s for clean, sequential data, pushing over 260 MiB/s out of the pipe under random mixed traffic proves that we are hitting the processing bounds of the Optane’s memory controller or the Pi’s interrupt handling capabilities at low queue depths—far before the raw link bandwidth of the ASMedia switch is fully saturated.
  • Ice-Cold Latency Consistency: The single most critical metric for handling LLM swaps is the 99th percentile read latency. With the Optane, 99% of all I/O requests are completely resolved in under 429 microseconds. Millisecond-range latency spikes are entirely non-existent.

And if you’re wondering how RAM and ZRAM would have performed here on the Raspberry Pi 5:

Storage MediumRead IOPSRead BandwidthWrite IOPSWrite BandwidthLatency (Average / 99th Percentile)
Pure RAM (Baseline)212.0003316 MiB/s (~3477 MB/s)91.0001421 MiB/s (~1490 MB/s)3,07 µs / 4,32 µs
ZRAM (Simulation)66.3001036 MiB/s (~1087 MB/s)28.400444 MiB/s (~466 MB/s)32,07 µs / 61,00 µs

Raspberry Pi 5 LLM Swap Benchmark: Intel Optane (PCIe) vs. Legacy USB Array

For the following benchmark 2.8 GB of physical RAM was locked into memory using mlockall, forcing Ollama to swap during token generation. Legacy USB Array refers to swap on the USB SSDs vs Intel Optane.

Benchmark Results

ModelMetricLegacy USB ArrayIntel Optane (PCIe)Performance Impact / Speedup
DeepSeek-R1 (1.5B)
Size: ~1.1 GB
Total Time
Tokens/sec
Avg CPU
Avg I/O
161.13 sec
8.33 tps
89.91%
3.63%
130.24 sec
8.28 tps
90.53%
4.56%
Optane is 24% faster on initial load.
Identical execution speed as the model still fits into the remaining physical RAM.
Gemma2 (2B)
Size: ~1.6 GB
Total Time
Tokens/sec
Avg CPU
Avg I/O
3284.85 sec (~55 min)
0.078 tps
14.56%
71.33%
678.92 sec (~11 min)
0.355 tps
23.38%
55.70%
Optane achieves a 4.8x total runtime speedup and delivers 4.6x more tokens per second under moderate swap pressure.
Phi4-Mini (3.8B)
Size: ~2.2 GB
Total Time
Tokens/sec
Avg CPU
Avg I/O
8164.01 sec (~2.27 hours)
0.038 tps
10.57%
77.73%
1433.03 sec (~24 min)
0.055 tps
10.29%
73.16%
Optane achieves a massive 5.7x total runtime speedup.
USB enters severe thrashing, rendering the system practically dead.
Qwen2.5 (3B)
Size: ~1.9 GB
Total Time
Tokens/sec
Avg CPU
Avg I/O
5985.16 sec (~1.66 hours)
0.055 tps
14.77%
75.62%
1744.43 sec (~29 min)
0.199 tps
20.48%
58.56%
Optane is 3.4x faster in runtime and yields 3.6x higher token throughput during sustained matrix calculations.

Key Takeaways

For the smallest model (DeepSeek-R1 1.5B), the token generation throughput (tps) is virtually identical between both setups (~8.3 tps). This happens because the model still fits comfortably into the remaining unallocated physical RAM. The only metric where the Intel Optane flexes its muscles here is the Total Time, loading the entire model from cold storage into memory 24% faster than the USB array.

The moment the model size scales past the available physical memory (Gemma2, Qwen2.5, and Phi4-Mini), the Legacy USB architecture hits a concrete wall. At a miserable 0.038 to 0.078 tokens per second, the USB array requires up to 26 seconds to generate a single word.

The data reveals a fascinating performance curve when looking at the larger models: For Gemma2 (2B) and Qwen2.5 (3B), the Intel Optane achieves a massive 4.6x and 3.6x speedup in token throughput. This is because LLM token generation (the Decode Phase) requires the CPU to loop through the entire model weights for every single token. Since parts of the model reside in swap, the Pi is forced into a continuous rolling read-loop. The Optane’s low latency keeps the pipeline moving, while the USB bridge stalls instantly.

However, when scaling up to the largest model, Phi4-Mini (3.8B), the token generation speedup on the Optane drops to a modest 1.4x (0.055 tps vs 0.038 tps), even though the Total Runtime is still 5.7x faster. This indicates that we are hitting a hard architectural bottleneck. Phi4-Mini leaves more than 1 GB of data inside the swap partition. Since the Optane’s mixed 16K random read speed tops out at 182 MiB/s, streaming a full gigabyte of random blocks for every single token causes an unyielding I/O wall on both storage mediums. The massive 5.7x gain in total time is heavily carried by the initial Prefill Phase (parsing the prompt matrix into swap), where the Optane simply obliterates the USB array.

While the single-board computer handles memory in 16 KB pages, executing LLM inference across a swap barrier triggers a continuous stream of synchronous, non-linear page faults. Because LLM weights are static and merely read from disk, these memory pages remain “clean” in RAM. When the kernel needs to free up memory for the next layer, it can simply drop these clean pages instead of writing them back to disk. Thus, the observed 77.7% I/O-wait on the USB array is not caused by heavy write-amplification cycles, but rather by the sheer protocol overhead of the xHCI host controller layer and the SCSI-to-NVMe translation protocol (UAS/BOT). When bombarded with thousands of synchronous 16 KB read requests, the bridge controller’s single hardware queue stalls, effectively starving the Pi’s CPU cores of instructions.

This architectural choke-point is clearly visible in the CPU metrics:

  • On the USB Array, CPU utilization plummets to a mere 10.5%, while I/O wait (avg_io) spikes to 77.7%. The four fast ARM Cortex-A76 cores of the Pi 5 are literally starving to death—spending nearly 80% of their operational cycles waiting for block execution confirmations over the USB hub.
  • On the Intel Optane, CPU utilization remains consistently higher. Thanks to its unique 3D XPoint architecture, it operates with single-digit microsecond access latencies and direct hardware interrupts. It feeds data segments into the PCIe lane fast enough to keep the processor actually processing, rather than waiting on protocol translation overhead.

Summary

You cannot solve a local LLM hardware bottleneck by simply throwing “enough generic storage” at a small single-board computer.

If your model sizes exceed your onboard RAM, raw sequential megabytes mean very little on paper. Instead ultra-low access latency, hardware-level queue concurrency, and high IOPS under mixed read/write workloads may be the key to success.

By utilizing the native PCIe interface of the Raspberry Pi 5 combined with an enterprise-grade architecture like Intel’s Optane, you bypass the USB protocol overhead and avoid the severe queue starvation and high random-read latencies of standard NAND flash under swap pressure. This cohesive architectural advantage allows you to virtualize system memory well beyond 4 GB while maintaining a benchmark-stable environment that outperforms generic USB setups by up to 570%.

And yes, using the SD card for swap is a bad idea – but you knew that already, didn’t you?

Quellen / See Also

  • axboe/fio. Flexible I/O Tester official repository and documentation. https://github.com/axboe/fio
  • Wikipedia. 3D XPoint non-volatile memory technology architecture. https://en.wikipedia.org/wiki/3D_XPoint
  • Wikipedia. USB Attached SCSI Protocol (UASP) specs and limitations. https://en.wikipedia.org/wiki/USB_Attached_SCSI

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.