Squeezing performance out of a Raspberry Pi 5 with an Intel Optane for LLM usage

Playing further with the Intel Optane module on my Raspberry Pi 5, I was able to squeeze out even more performance. The recipe? Changing the NVMe HAT, forcing PCIe Gen 3, overclocking the CPU, and …

1. The Baseline

To establish a starting point, the first test was executed with the CPU at 2.4 GHz and the PCIe bus at its default PCIe Gen 2 speed.

Under hdparm, the sequential read limit saturated at 416 MB/s. Forcing the Qwen2.5 (7B) model through this setup resulted in severe page thrashing. The model had to be reread from the swap partition roughly 19.5 times, transferring a total of 87.04 GB.

1.1 Pushing the controller: PCIe Gen 2 to 3

By upgrading to a compatible PCIe Gen 3 Top-HAT (Geekworm X1001) and adding the required lines to /boot/firmware/config.txt (dtparam=pciex1_gen=3), the raw hdparm throughput jumped to 779.55 MB/s — the physical saturation point of a single Gen 3 lane on the Pi 5.

Metric	Run A	Run B
CPU Clock	2.4 GHz	2.4 GHz
PCIe Generation	Gen 2	Gen 3
Total Duration	320.74 s	224.71 s
Token Throughput	0.10 tps	0.16 tps
Time to First Token (TTFT)	65.81 s	62.79 s
Effective I/O Bandwidth	277.87 MB/s	397.04 MB/s
NVMe Read Activity	9,708 IOPS	12,103 IOPS
Hardware Reaction Time	0.93 ms	0.55 ms

Result: Unlocking the PCIe Gen 3 bus reduced the overall execution time by roughly 30%.

1.2 Pushing Compute: CPU Overclocking

With adequate active cooling in place, the Broadcom SoC was overclocked by 16.67% to 2.8 GHz without increasing the voltage. This run defines the baseline for all subsequent software optimization steps.

Metric	Run B	Run C
CPU Clock	2.4 GHz	2.8 GHz
PCIe Generation	Gen 3	Gen 3
Total Duration	224.71 s	215.25 s
Token Throughput	0.16 tps	0.17 tps
Time to First Token (TTFT)	62.79 s	54.99 s
Effective I/O Bandwidth	397.04 MB/s	414.73 MB/s
NVMe Read Activity	12,103 IOPS	14,546 IOPS
Hardware Reaction Time	0.55 ms	0.54 ms

Result: The higher CPU clock lowered the execution time by another 4% and reduced the TTFT to 54.99 seconds. The effective I/O bandwidth stabilized at 414.73 MB/s. Note that this ~415 MB/s is not a limitation of the PCIe Gen 3 bus itself, but rather the ceiling of the specific random-read pattern generated by the page faults under this workload.

2. Further Tuning: The dead end

With the hardware running at maximum, the next step involved trying to tune the kernel’s virtual memory subsystem.

2.1 Challenges

The most challenging aspect of this project was engineering a reliable method to test the swap subsystem and extract stable, reproducible metrics. Benchmarking tools like fio will not help much in this scenario. When you use fio with –direct=1, it utilizes O_DIRECT to bypass the Linux page cache completely. While this is great for measuring raw hardware performance, it completely fails to simulate how an application triggers page faults under heavy memory pressure. To capture the true behavior of an LLM thrashing a tight memory space, you have to force actual kernel page-evictions under a strict physical memory lock.

So… I used a mix of Bash and Python scripts which…

clear page cache,
re-create
re-mount swap on the optane
re-start ollama
lock 2.5 GB of RAM
run a LLM (Qwen2.5:7b).

For the initial tuning steps, I forced Ollama to use `use_mmap: False` to ensure the model weights were explicitly hitting the anonymous memory space and forcing swap activity on the Optane. However, kernel-level swap tuning (zswap, zram, or tweaking vm.page-cluster) yielded no significant performance gains. In fact, increasing `vm.page-cluster` to 5 inflated the hardware reaction time to 0.7 ms due to unnecessary overhead, while values from 0 to 3 performed virtually identically.

Metric	Run C	Run D	Run G
CPU Clock	2.8 GHz	2.8 GHz	2.8 GHz
PCIe Generation	Gen 3	Gen 3	Gen 3
Ollama mmap-Flag	False	False	True
Tweaks	Standard	vm.page-cluster = 3	Standard
Total Duration	215.25 s	213.66 s	211.05 s
Token Throughput	0.17 tps	0.17 tps	0.17 tps
Time to First Token (TTFT)	54.99 s	57.36 s	54.91 s
Effective I/O Bandwidth	414.73 MB/s	410.42 MB/s	414.93 MB/s
NVMe Read Activity	14,546 IOPS	12,436 IOPS	26,556 IOPS
Hardware Reaction Time	0.54 ms	0.42 ms	0.07 ms

Result: The real breakthrough occurred in Run G by switching back to `use_mmap: True`. The hardware response time collapsed to an incredible 0.07 ms. Architecturally, this makes perfect sense: memory mapping files makes the pages file-backed rather than anonymous. Under memory pressure, the kernel simply discards clean mmap-pages and streams them back via native filesystem readahead, bypassing the entire swap subsystem (and rendering parameters like vm.page-cluster obsolete for the model weights). Concurrently, this unhindered asynchronous queue pushed NVMe read activity to 26,556 IOPS.

I am somewhat disappointed that even though I tested several sysctl and kernel parameters, I did not find much remaining potential to improve this on a Raspberry Pi 5. I also evaluated NVMe block-layer polling with promising initial results, but it introduces a sharp trade-off: you have to isolate an entire CPU core for polling, forcing Ollama down to 3 cores—which ultimately yields the exact same performance as running Ollama on 4 cores without polling. Clearly fighting the hardware limits (the single-lane PCIe) here.

3. Benchmark with alternative Models

To see how this setup performs, the final phase moved away from the baseline script (which used a strict 128-token cap to isolate hardware limits) to a real-world workload.

For this benchmark, the models were fed a complex text-analysis and structured reasoning prompt. The task required the models to synthesize unstructured news data and output a structured analysis including a confidence score and a telegraphic reasoning block. Because these runs involve dynamic token generation lengths and multi-step reasoning, the processing times and tokens-per-second naturally shift compared to the hardware-bound baseline.

Here is how different LLM architectures handle this heavy reasoning pipeline on the optimized Pi 5 + Optane stack:

Model	Total Time	Tokens (In/Out)	Speed (t/s)
deepseek-r1:1.5b	70.01s (1:10 min)	626 / 377	9.57
gemma2:2b	62.34s (1:02 min)	642 / 55	5.16
qwen2.5:3b	78.19s (1:18 min)	652 / 53	4.49
qwen2.5:7b-instruct-q3_K_M	208.01s (3:28 min)	652 / 54	1.76
phi4-mini-reasoning:3.8b	103.01s (1:43 min)	611 / 30	1.04
llama3.1:8b-instruct-q3_K_M	362.90s (6:03 min)	608 / 63	0.37
qwen2.5:7b (Base)	367.77s (6:08 min)	652 / 49	0.26
llama3.1:8b (Base)	551.92s (9:12 min)	608 / 57	0.15
gemma2:9b	1884.27s (31:24 min)	642 / 61	0.04
qwen2.5:14b	6060.96s (101:01 min)	652 / 69	0.01
phi4	8150.56s (135:51 min)	608 / 89	0.01
ministral-3:14b	24416.55s (406:57 min)	1199 / 313	0.01

4. Conclusion

Optimizing the hardware and kernel parameters reduced the total benchmark duration from 320.74 seconds to 211.05 seconds — a 34% reduction in execution time.

The token throughput of the 7B model flatlines at 0.17 tps because of a clear physical bottleneck: processing a single token requires the CPU to cycle through the entire ~4.3 GB of weights. At a real-world effective throughput limit of ~415 MB/s under page-fault conditions, the pipeline cannot be fed any faster.

However, dropping the kernel latency to 70 microseconds via memory mapping significantly improves system responsiveness under full load. By shifting the architecture to use_mmap: True, the kernel fetches only the strictly required blocks with minimal overhead directly via the file cache, allowing the Intel Optane to operate efficiently within its ultra-low-latency design parameters.