Squeezing performance out of a Raspberry Pi 5 with an Intel Optane for LLM usage

Playing further with the Intel Optane module on my Raspberry Pi 5, I was able to squeeze out even more performance. The recipe? Changing the NVMe HAT, forcing PCIe Gen 3, overclocking the CPU, and …

1. The Baseline

To establish a starting point, the first test was executed with the CPU at 2.4 GHz and the PCIe bus at its default PCIe Gen 2 speed.

Under hdparm, the sequential read limit saturated at 416 MB/s. Forcing the Qwen2.5 (7B) model through this setup resulted in severe page thrashing. The model had to be reread from the swap partition roughly 19.5 times, transferring a total of 87.04 GB.

1.1 Pushing the controller: PCIe Gen 2 to 3

By upgrading to a compatible PCIe Gen 3 Top-HAT (Geekworm X1001) and adding the required lines to /boot/firmware/config.txt (dtparam=pciex1_gen=3), the raw hdparm throughput jumped to 779.55 MB/s — the physical saturation point of a single Gen 3 lane on the Pi 5.

MetricRun A Run B
CPU Clock2.4 GHz2.4 GHz
PCIe GenerationGen 2Gen 3
Total Duration320.74 s224.71 s
Token Throughput0.10 tps0.16 tps
Time to First Token (TTFT)65.81 s62.79 s
Effective I/O Bandwidth277.87 MB/s397.04 MB/s
NVMe Read Activity9,708 IOPS12,103 IOPS
Hardware Reaction Time0.93 ms0.55 ms

Result: Unlocking the PCIe Gen 3 bus reduced the overall execution time by roughly 30%.

1.2 Pushing Compute: CPU Overclocking

With adequate active cooling in place, the Broadcom SoC was overclocked by 16.67% to 2.8 GHz without increasing the voltage. This run defines the baseline for all subsequent software optimization steps.

MetricRun BRun C
CPU Clock2.4 GHz2.8 GHz
PCIe GenerationGen 3Gen 3
Total Duration224.71 s215.25 s
Token Throughput0.16 tps0.17 tps
Time to First Token (TTFT)62.79 s54.99 s
Effective I/O Bandwidth397.04 MB/s414.73 MB/s
NVMe Read Activity12,103 IOPS14,546 IOPS
Hardware Reaction Time0.55 ms0.54 ms

Result: The higher CPU clock lowered the execution time by another 4% and reduced the TTFT to 54.99 seconds. The effective I/O bandwidth stabilized at 414.73 MB/s. Note that this ~415 MB/s is not a limitation of the PCIe Gen 3 bus itself, but rather the ceiling of the specific random-read pattern generated by the page faults under this workload.

2. Further Tuning: The dead end

With the hardware running at maximum, the next step involved trying to tune the kernel’s virtual memory subsystem.

2.1 Challenges

The most challenging aspect of this project was engineering a reliable method to test the swap subsystem and extract stable, reproducible metrics. Benchmarking tools like fio will not help much in this scenario. When you use fio with –direct=1, it utilizes O_DIRECT to bypass the Linux page cache completely. While this is great for measuring raw hardware performance, it completely fails to simulate how an application triggers page faults under heavy memory pressure. To capture the true behavior of an LLM thrashing a tight memory space, you have to force actual kernel page-evictions under a strict physical memory lock.

So… I used a mix of Bash and Python scripts which…

  • clear page cache,
  • re-create
  • re-mount swap on the optane
  • re-start ollama
  • lock 2.5 GB of RAM
  • run a LLM (Qwen2.5:7b).

For the initial tuning steps, I forced Ollama to use `use_mmap: False` to ensure the model weights were explicitly hitting the anonymous memory space and forcing swap activity on the Optane. However, kernel-level swap tuning (zswap, zram, or tweaking vm.page-cluster) yielded no significant performance gains. In fact, increasing `vm.page-cluster` to 5 inflated the hardware reaction time to 0.7 ms due to unnecessary overhead, while values from 0 to 3 performed virtually identically.

MetricRun C Run DRun G
CPU Clock2.8 GHz2.8 GHz2.8 GHz
PCIe GenerationGen 3Gen 3Gen 3
Ollama mmap-FlagFalseFalseTrue
TweaksStandardvm.page-cluster = 3Standard
Total Duration215.25 s213.66 s211.05 s
Token Throughput0.17 tps0.17 tps0.17 tps
Time to First Token (TTFT)54.99 s57.36 s54.91 s
Effective I/O Bandwidth414.73 MB/s410.42 MB/s414.93 MB/s
NVMe Read Activity14,546 IOPS12,436 IOPS26,556 IOPS
Hardware Reaction Time0.54 ms0.42 ms0.07 ms

Result: The real breakthrough occurred in Run G by switching back to `use_mmap: True`. The hardware response time collapsed to an incredible 0.07 ms. Architecturally, this makes perfect sense: memory mapping files makes the pages file-backed rather than anonymous. Under memory pressure, the kernel simply discards clean mmap-pages and streams them back via native filesystem readahead, bypassing the entire swap subsystem (and rendering parameters like vm.page-cluster obsolete for the model weights). Concurrently, this unhindered asynchronous queue pushed NVMe read activity to 26,556 IOPS.

I am somewhat disappointed that even though I tested several sysctl and kernel parameters, I did not find much remaining potential to improve this on a Raspberry Pi 5. I also evaluated NVMe block-layer polling with promising initial results, but it introduces a sharp trade-off: you have to isolate an entire CPU core for polling, forcing Ollama down to 3 cores—which ultimately yields the exact same performance as running Ollama on 4 cores without polling. Clearly fighting the hardware limits (the single-lane PCIe) here.

3. Benchmark with alternative Models

To see how this setup performs, the final phase moved away from the baseline script (which used a strict 128-token cap to isolate hardware limits) to a real-world workload.

For this benchmark, the models were fed a complex text-analysis and structured reasoning prompt. The task required the models to synthesize unstructured news data and output a structured analysis including a confidence score and a telegraphic reasoning block. Because these runs involve dynamic token generation lengths and multi-step reasoning, the processing times and tokens-per-second naturally shift compared to the hardware-bound baseline.

Here is how different LLM architectures handle this heavy reasoning pipeline on the optimized Pi 5 + Optane stack:

ModelTotal TimeTokens (In/Out)Speed (t/s)
deepseek-r1:1.5b70.01s (1:10 min)626 / 3779.57
gemma2:2b62.34s (1:02 min)642 / 555.16
qwen2.5:3b78.19s (1:18 min)652 / 534.49
qwen2.5:7b-instruct-q3_K_M208.01s (3:28 min)652 / 541.76
phi4-mini-reasoning:3.8b103.01s (1:43 min)611 / 301.04
llama3.1:8b-instruct-q3_K_M362.90s (6:03 min)608 / 630.37
qwen2.5:7b (Base)367.77s (6:08 min)652 / 490.26
llama3.1:8b (Base)551.92s (9:12 min)608 / 570.15
gemma2:9b1884.27s (31:24 min)642 / 610.04
qwen2.5:14b6060.96s (101:01 min)652 / 690.01
phi48150.56s (135:51 min)608 / 890.01
ministral-3:14b24416.55s (406:57 min)1199 / 3130.01

4. Conclusion

Optimizing the hardware and kernel parameters reduced the total benchmark duration from 320.74 seconds to 211.05 seconds — a 34% reduction in execution time.

The token throughput of the 7B model flatlines at 0.17 tps because of a clear physical bottleneck: processing a single token requires the CPU to cycle through the entire ~4.3 GB of weights. At a real-world effective throughput limit of ~415 MB/s under page-fault conditions, the pipeline cannot be fed any faster.

However, dropping the kernel latency to 70 microseconds via memory mapping significantly improves system responsiveness under full load. By shifting the architecture to use_mmap: True, the kernel fetches only the strictly required blocks with minimal overhead directly via the file cache, allowing the Intel Optane to operate efficiently within its ultra-low-latency design parameters.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.