Squeezing FLOPS out of a PlayStation Portable
Exploring the Allegrex CPU, hand-tuning scalar code, writing VFPU assembly, porting to Rust, and benchmarking IO — all in pursuit of running a neural network on the console of my youth.
This project started with a simple question: can a 2004 handheld run modern ML inference?
Not as a toy demo, but something actually useful - specifically BirdNET, for recognizing bird songs from the PSP’s mic.
This post is the full path from first hello all the way to “can this thing hear birds?”: loading code on real hardware, reading MIPS output instruction-by-instruction, using the VFPU, porting to Rust, benchmarking IO, recording audio, and sanity-checking whether end-to-end inference is realistic.
Phase 1: Hello World — Understanding the Hardware
First step was getting anything to run on real hardware. The pspdev toolchain gives us a GCC cross-compiler targeting mipsel-psp-elf. On macOS:
curl -L https://github.com/pspdev/pspdev/releases/latest/download/pspdev-macos-latest-arm64.tar.gz | tar xz
xattr -rd com.apple.quarantine ./pspdev
export PSPDEV="$PWD/pspdev"
export PATH="$PSPDEV/bin:$PATH"
Hello world on the PSP looks like this:
#include <pspkernel.h>
#include <pspdebug.h>
#include <pspdisplay.h>
PSP_MODULE_INFO("Hello", 0, 1, 0);
PSP_MAIN_THREAD_ATTR(THREAD_ATTR_USER);
#define printf pspDebugScreenPrintf
int main(void) {
pspDebugScreenInit();
printf("Hello from PSP!\n");
while (1) { sceDisplayWaitVblankStart(); }
return 0;
}
Simple enough. But what actually happens between main() and pixels on the LCD?
The EBOOT.PBP Container
A container format that wraps an ELF with metadata for the XMB.
The ELF and DATA.PSP inside the EBOOT are slightly different. PSPSDK has a PSP_EBOOT target that strips debug sections and symbol tables from the ELF.
Linking and Relocation
A little compiler 101:
- Generate an object file with ELF type
REL(relocatable). - GCC puts startup code in
.text.startup(not.text, surprisingly). - On this MIPS core that helps because the instruction cache is only 16 KiB; we keep startup noise away from hot code.
Before linking, function calls look like this:
8: 0c000000 jal 0 <main> ; jump to address 0 (!)
jal 0 means jump to address 0. We haven’t linked yet, so the compiler writes relocation info:
Relocation section '.rel.text.startup':
Offset Info Type Sym. Name
00000008 00001004 R_MIPS_26 pspDebugScreenInit
8 bytes into .text.startup, patch the lower 26 bits with the target symbol address (pspDebugScreenInit), preserving the instruction bits (JAL opcode) via R_MIPS_26.
Execution: No MMU, Just Load and Go
The PSP has no MMU, so loading is pretty literal. sceKernelCreateThread allocates/initializes a TCB; sceKernelStartThread sets it READY and passes arguments.
Context switching is the fun part: we know what we need to load into registers, but we also need registers to run the context-switch code.
The Allegrex Instruction Cache
Cache geometry really matters here because the CPU budget is tiny. Allegrex has a 16 KiB, 2-way set-associative I-cache with 64-byte lines:
16 KiB / 64 B per line = 256 lines total
256 lines / 2 ways = 128 sets
Address decomposition (32 bits):
┌─────────────────────┬───────────┬──────────┐
│ Tag │ Set Index │ Offset │
│ (19 bits) │ (7 bits) │ (6 bits) │
└─────────────────────┴───────────┴──────────┘
Each set has 2 lines scanned in parallel for tag match.
Each line holds 16 instructions.
Only 256 cache lines total, so every instruction matters.
What Else Is Inside the Tachyon SoC
Tachyon has more than just Allegrex:
- VFPU (COP2) — 128 32-bit registers, ~22 cycles for a 4×4 matmul.
- Media Engine — another MIPS core for media encoding/decoding (not directly accessible to us).
- VME — kind of a mystery/reconfigurable DSP.
- Graphics Engine — MMIO at
0xBD400000, command-driven.
Phase 2: Optimizing Scalar Matrix Multiply
Now for the real question: how fast can we multiply matrices on this thing?
Start with the obvious triple loop in C:
void mat_mul(const float *A, const float *B, float *C,
int M, int N, int K) {
for (int i = 0; i < M; i++)
for (int j = 0; j < N; j++) {
float sum = 0.0f;
for (int k = 0; k < K; k++)
sum += A[i*K + k] * B[k*N + j];
C[i*N + j] = sum;
}
}
Compiled with -O2, this comes out to 172 bytes of MIPS for mat_mul.
Here’s the tight scalar loop and generated hot path side-by-side:
for (int k = 0; k < K; k++)
sum += A[i*K + k] * B[k*N + j];34: lwc1 $f0, 0(v0) ; load A[i*K + k]
38: lwc1 $f2, 0(v1) ; load B[k*N + j]
3c: addiu v0, v0, 4 ; advance A pointer (contiguous)
40: addu v1, v1, t1 ; advance B pointer (strided)
44: mul.s $f0, $f0, $f2 ; single-precision multiply
48: bne a0, v0, 0x34 ; loop if more k values
4c: add.s $f1, $f1, $f0 ; (delay slot) sum += productReading the Assembly
Register allocation follows MIPS EABI conventions:
Arguments:
a0 = A (pointer to first matrix)
a1 = B (pointer to second matrix)
a2 = C (pointer to result matrix)
a3 = M (rows of A / rows of C)
t0 = N (cols of B / cols of C) — 5th arg in register (EABI)
t1 = K (cols of A / rows of B) — 6th arg in register (EABI)
That last add.s sits in the branch delay slot. On MIPS, the instruction after a branch always executes, so GCC gets accumulation + loop control overlap for free.
Eliminating Dead Code with __builtin_unreachable
Even at -O2, there are obvious inefficiencies:
- early exits for
M <= 0andN <= 0 - a zero-fill fallback path when
K <= 0
For this benchmark, all dims are positive, so those paths are dead weight.
The zero-fill fallback is especially wasteful. If K <= 0, the inner loop never runs, but sum was initialized to 0.0f, so GCC emits a whole loop to store zeros.
We can tell GCC these branches are impossible via __builtin_unreachable:
38: addiu t4, t4, 1 ; j++
3c: sw zero, 0(t3) ; C[i*N + j] = 0
40: addiu t5, t5, 4 ; advance B pointer (unused!)
44: bne t0, t4, 0x38 ; loop until done
48: addiu t3, t3, 4 ; (delay slot) advance C pointerif (M <= 0) __builtin_unreachable();
if (N <= 0) __builtin_unreachable();
if (K <= 0) __builtin_unreachable();The Cascade Effect
Expected savings:
- 3 early-exit instructions
- 6 instructions from guard + zero-fill loop
Actual savings were bigger. We also lose:
- 2 instructions from duplicated return blocks
- 1 trampoline branch for final store
Once dead paths are removed, the last store just falls through naturally.
Result: 3 + 6 + 2 + 1 = 12 instructions removed. 172 bytes → 124 bytes.
mat_mulnow fits in 2 cache lines instead of 3 - wahoo.
The original code layout had this structure:
0x00 [entry + guards]
0x10 [outer loop setup]
0x24 [middle loop start]
0x30 bgtz t1, 0x78 ───────┐ "if K > 0, goto inner loop"
0x38 [zero-fill fallback] │
0x4c [outer loop advance] │
0x60 [return] │
0x68 [store + advance j] │ ← trampoline target
0x78 [inner loop setup] ◄───┘
0x84 [inner loop body]
0x98 bne → 0x84 "loop back if more k"
0xa0 b → 0x68 ← extra branch to store
0xa8 [duplicate return]
After the optimization, it collapses to a clean linear flow:
0x00 [setup]
0x18 [outer loop]
0x28 [inner loop setup]
0x34 [inner loop body]
0x48 bne → 0x34 "loop back if more k"
0x50 [store + advance j] ← falls through, no trampoline
0x5c bne → 0x28 "next j"
0x64 [outer loop advance]
0x70 bne → 0x18 "next i"
0x78 [return]
Why Not Unroll?
Normally we’d consider unrolling a 7-instruction loop. Here it doesn’t buy much:
- branch overhead is minimal (delay slot already used productively)
- MIPS32 has no
madd.s(that shows up in later ISAs) - unrolling increases code size and register pressure
Not worth it on this hardware.
Phase 3: Unleashing the VFPU
We squeezed out what we could from scalar code, so now we go VFPU.
Register Architecture
VFPU has 128 32-bit registers, grouped as 8 primitive 4×4 matrices.
Important detail: sub-vectors/sub-matrices never cross primitive matrix boundaries. Addressing modes can alias the same physical registers, so some instructions require non-overlapping input/output regs.
Instruction suffixes are .s/.p/.t/.q, register prefixes are S/R/C/M.
Best reference is still the pspdev VFPU docs.
The Column-Major Surprise
There are a few SDK options. openTri has triMat4Mul, and disassembling it is… weird:
move v0, a0 ; save output pointer
lv.q C100, 0(a1) ; load rows of B as column vectors
lv.q C110, 16(a1)
lv.q C120, 32(a1)
lv.q C130, 48(a1)
lv.q C200, 0(a2) ; load rows of C as column vectors
lv.q C210, 16(a2)
lv.q C220, 32(a2)
lv.q C230, 48(a2)
vmmul.q E000, E200, E100 ; matrix multiply
sv.q C000, 0(a0) ; store result as column vectors
sv.q C010, 16(a0)
sv.q C020, 32(a0)
sv.q C030, 48(a0)
jr ra
It looks like row-major data is loaded as columns, order is reversed, then output is flipped back. Very convoluted.
I initially thought it was wrong and had Claude generate a PPSSPP test harness. The VFPU output looked transposed:
Scalar: VFPU (tri):
19 22 23 34
43 50 31 46
Then it clicked: VFPU is naturally column-major (Sx01 is one row below Sx00), while C arrays are row-major.
So the “gymnastics” are transpose-in, multiply, transpose-out. Correct, but expensive if you include transpose cost.
A Direct Approach
Instead, we can just write inline asm that accepts normal C arrays:
void mat_mul_4x4_vfpu(float *C, const float *A, float *B) {
__asm__ volatile(
"lv.q R000, 0(%1)\n" // row 0 of A → row 0 of M0
"lv.q R001, 16(%1)\n" // row 1 of A → row 1 of M0
"lv.q R002, 32(%1)\n" // row 2 of A → row 2 of M0
"lv.q R003, 48(%1)\n" // row 3 of A → row 3 of M0
"lv.q R100, 0(%2)\n" // row 0 of B → row 0 of M1
"lv.q R101, 16(%2)\n" // row 1 of B → row 1 of M1
"lv.q R102, 32(%2)\n" // row 2 of B → row 2 of M1
"lv.q R103, 48(%2)\n" // row 3 of B → row 3 of M1
"vmmul.q M200, M000, M100\n" // M2 = M0 × M1
"sv.q R200, 0(%0)\n" // store result rows
"sv.q R201, 16(%0)\n"
"sv.q R202, 32(%0)\n"
"sv.q R203, 48(%0)\n"
: : "r"(C), "r"(A), "r"(B) : "memory"
);
}
Profiling
PSP has profiler registers at 0xBC400000, but PPSSPP doesn’t back them. Trying pspDebugProfilerEnable segfaults because that mapped address isn’t actually allocated.
So in emulator I used sceKernelGetSystemTimeLow over 10,000 iterations:
| Implementation | Emulator (PPSSPP) | Real PSP |
|---|---|---|
| Scalar 4×4 | 28,440 μs | 48,024 μs |
| VFPU (openTri, includes transpose work) | 16,877 μs | — |
| VFPU (inline asm) | 1,307 μs | 2,853 μs |
Our inline asm is almost 12× faster than openTri when openTri includes transpose work (16,877 μs). If we move transpose work outside the benchmark loop, both converge around ~1,300 μs for 10,000 calls in emulation.
Sanity-Checking the Numbers
Quick sanity check:
vmmul.qthroughput ~16 cycles, latency ~22 cycles- lower bound is roughly 8 loads + vmmul latency + 4 stores + loop overhead ≈ 37 cycles/op
- Allegrex is 333 MHz
1,300 μs × 333 cycles/μs = 432k cycles
432k / 10k iterations = 43.2 cycles/iteration
That lands near ~14% overhead, which is plausible for emulation.
Real hardware is different. Scalar→VFPU speedup is 21.7× in emulator vs 16.8× on PSP. Makes sense: emulator can implement VFPU ops on modern CPUs without all real hardware constraints.
Enabling the VFPU for PRX Modules
One gotcha with PSPLINK: VFPU must be explicitly enabled via thread attributes. Without THREAD_ATTR_VFPU, first VFPU instruction throws “Coprocessor unusable”:
Exception - Coprocessor unusable
Thread ID - 0x042DB813
EPC - 0x08806B48
Cause - 0x2000002C
EPC points to the faulting instruction; disassembly confirms it’s the first lv.q. Fix is one flag:
PSP_MAIN_THREAD_ATTR(THREAD_ATTR_USER | THREAD_ATTR_VFPU);
Phase 4: Porting to Rust and Measuring IO
With C side validated, I moved to Rust via cargo-psp.
How cargo-psp Works
At a high level, cargo-psp does:
cargo build \
-Z build-std=core,compiler_builtins,alloc,panic_unwind,panic_abort \
--target mipsel-sony-psp \
--message-format=json-render-diagnostics
Then fixups/import handling, prxgen (ELF→PRX), MKSFO, then pbp-pack to make EBOOT.PBP.
The PSP’s Dynamic Linking Model
PSP syscall loading is beautifully weird. You call sceDisplayWaitVblankStart, but linked PRX still has unresolved stubs; firmware patches syscall targets at load time.
The kernel uses four sections for this:
// 1. Library names
.rodata.sceResident → "sceDisplay", "sceNet", ...
// 2. Function IDs (32-bit hashes)
.rodata.sceNid → sceDisplayWaitVblankStart = 0x984C27E7
// 3. Stub code (placeholder, patched by firmware)
.sceStub.text → jr $ra; syscall N
// 4. Index tying it all together
.lib.stub → { name, nid_table, stub_table, count }
Kernel walks .lib.stub, resolves library names/NIDs, and patches stubs. So you only ship stubs for functions you actually call.
.text .sceStub.text
───── ──────────────
jal 0xf1f8 ──────────────────► [placeholder data]
│
│ PSP loader patches
▼
jr $ra
syscall N
VFPU in Rust
Rust version uses vfpu_asm! (nightly + #![feature(asm_experimental_arch)]) and looks basically the same as C inline asm.
unsafe {
vfpu_asm!(
"lv.q R000, 0({0})",
"lv.q R001, 16({0})",
"lv.q R002, 32({0})",
"lv.q R003, 48({0})",
"lv.q R100, 0({1})",
"lv.q R101, 16({1})",
"lv.q R102, 32({1})",
"lv.q R103, 48({1})",
"vmmul.q M200, M000, M100",
"sv.q R200, 0({2})",
"sv.q R201, 16({2})",
"sv.q R202, 32({2})",
"sv.q R203, 48({2})",
in(reg) (a.as_ptr()),
in(reg) (b.as_ptr()),
in(reg) (c.as_mut_ptr()),
options(nostack),
);
}
IO Benchmarks: How Fast Can We Feed the Beast?
IO matters a lot here; we’re moving a lot of data relative to PSP memory/bandwidth.
During development, primary path is PSPLINK over USB (host0: via usbhostfs).
IoFileMgrForUser parses path prefixes (ms0:, host0:, etc.) and forwards to the right driver. usbhostfs read block max is 64 KiB:
#define HOSTFS_MAX_BLOCK (64*1024) // max read block
#define HOSTFS_BULK_MAXWRITE (1024*1024) // max write block
Before DMA, driver calls sceKernelDcacheWritebackRange so USB doesn’t read stale dirty-cache data.
Benchmarking 4 MiB transfers (64 KiB blocks) on real hardware:
| Direction | Bandwidth | Time (4 MiB) |
|---|---|---|
| Read (USB → PSP) | 22.81 MiB/s | 175 ms |
| Write (PSP → USB) | 14.90 MiB/s | 268 ms |
Can We Actually Run a Neural Network?
With an eye to BirdNET v2.4, let’s do the math:
| Parameter | Value |
|---|---|
| Model size (FP32) | 50.5 MB |
| Load time over USB | ~2.21 s |
| Total FLOPs | 826 MFLOP |
| Measured VFPU throughput | 348 MFLOP/s |
| Inference time (estimate) | ~2.37 s |
| RAM remaining after model | ~16 MB |
Throughput estimate comes from what I measured on PSP with 10,000 4×4 matmuls: each is 64 multiplies + 48 adds = 112 FLOPs, giving ~348 MFLOP/s.
Verdict: ~5s from cold start to first inference. Tight, but workable.
Also, 0.348 GFLOP/s is far below the VFPU’s 3.2 GFLOP/s theoretical peak. There’s headroom if we’re clever.
Phase 5: Recording Audio on the PSP
To record bird sounds, we need mic access.
I wanted to prototype whether PSP audio quality is good enough for birdnet via https://birdnet.cornell.edu/demo/, but the only recorder homebrew I found was PSP Audio Recorder (2006), and it runs as a kernel module, which is cumbersome (can’t run it over psplink, for example).
So I rewrote a similar recorder in Rust.
The Kernel Module Problem
The original C audio recorder uses a two-stage loader pattern typical of PSP kernel modules. A user-mode stub loads the actual kernel PRX:
PSP_MODULE_INFO("BOOT_PRX", 0x1000, 1, 1); // 0x1000 = kernel mode
int main_thread(SceSize args, void *argp) {
pspKernelSetKernelPC();
pspSdkInstallNoDeviceCheckPatch();
pspSdkInstallNoPlainModuleCheckPatch();
// ... load and start record.prx
mod = sceKernelLoadModule(path, 0, NULL);
mod = sceKernelStartModule(mod, ...);
}
That flow needs kernel privileges, CFW patch calls, and a separate PRX. Bad dev loop.
Good news: sceAudioInputBlocking works fine in user mode on modern CFW, so we can skip the kernel-module path.
A User-Mode Recorder in Rust
Rust implementation is straightforward: sceAudioInputInit, then sceAudioInputBlocking in a loop with 1024-sample chunks at 44.1 kHz, 16-bit mono:
fn record_chunk(&mut self) {
let mut buf = [0i16; 1024];
unsafe {
sceAudioInputBlocking(
buf.len() as i32,
self.get_input_freq(),
buf.as_mut_ptr() as *mut c_void,
);
}
self.samples.extend_from_slice(&buf);
}
Samples go into Vec<i16> (extern crate alloc), then we write WAV with a manual 44-byte header (RIFF/WAVE + PCM payload).
Offline-First Design
Practical design constraint: PSP won’t always be tethered over USB.
So recorder is offline-first:
- if
host0:is available, write WAV directly to host - otherwise cache to
ms0:/PSP/MUSIC/AUDREC/ - next time USB is connected, upload cached recordings
We detect connectivity by attempting to open host0:/ as a directory — if sceIoDopen returns a non-negative file descriptor, we’re connected:
fn is_host0_available() -> bool {
let fd = unsafe { sceIoDopen(b"host0:/\0".as_ptr()) };
if fd.0 >= 0 {
unsafe { sceIoDclose(fd) };
true
} else { false }
}
Phase 6: Can the PSP’s Mic Hear Birds?
Before writing the whole inference stack, first question was simple: is PSP mic quality even good enough for BirdNET?
I recorded a few bird songs from my living room (they’re pretty far away) and uploaded them to birdnet.cornell.edu/demo.
It worked. BirdNET picked up a Dark-eyed Junco (Junco hyemalis) from a PSP recording.
Validation: PSP mic quality is good enough for bird-song recognition. End-to-end is feasible; the rest is engineering.
Phase 7: Error Handling in no_std Rust
As the codebase grew, debugging got painful. PSP syscalls return negative i32 values like -2147418110, which is not super readable.
PSP Error Code Anatomy
PSP error codes follow 0x8XYYNNNN:
| Prefix | Category |
|---|---|
0x8000XXXX | Common/generic errors |
0x8001XXXX | POSIX errno (maps directly to C errno values) |
0x8002XXXX | Kernel errors |
0x8021XXXX | UMD errors |
0x8001XXXX maps directly to POSIX errno values (0x80010002 = ENOENT, 0x8001000D = EACCES, etc.).
Why Not Just Use strerror()?
First instinct was to call newlib strerror() via FFI:
#[link(name = "c")]
extern "C" {
fn strerror(errnum: c_int) -> *const c_char;
}
Doesn’t work. rust-psp is pure no_std and doesn’t link newlib/libc.
When force-linking libc.a:
- linker couldn’t find it without explicit search path
- once found, ABI/binding errors showed up (
symbol has invalid binding: 0)
So practically: not worth invasive rust-psp build changes just for error strings.
A Manual Lookup Table
Practical solution is a manual lookup table:
fn psp_strerror(code: i32) -> &'static str {
let code = code as u32;
match code {
0x80010002 => "ENOENT: No such file or directory",
0x8001000D => "EACCES: Permission denied",
0x80000022 => "SCE_ERROR_OUT_OF_MEMORY",
// ... ~30 more entries
_ => match code & 0xFFFF0000 {
0x80010000 => "Unknown POSIX errno",
0x80020000 => "Unknown kernel error",
_ => "Unknown error",
}
}
}
Fallback masks upper 16 bits so unknown codes are still categorized by subsystem. Works in no_std with zero dependencies.
// Before:
dprintln!("Error: {}", fd.0);
// Error: -2147418110
// After:
dprintln!("Error: {:#010x} {}", fd.0 as u32, psp_strerror(fd.0));
// Error: 0x80010002 ENOENT: No such file or directory
Appendix: im2col Visualization
I had Gemini sketch an im2col visualizer, so I cleaned it up and embedded it here as an appendix.
Use the mode switch to compare direct convolution vs im2col/GEMM form, then step through the 16 positions.
im2col Convolution Visualizer
2x2 kernel on a 5x5 image. Every step computes one full receptive field dot-product and stores one output value.
Output[0, 0] = dot(patch #0, kernel) = 6
Input Image (5 x 5)
Kernel (2 x 2)
One step = one full patch x kernel multiply.
Output Feature Map (4 x 4)
What I Learned
Main takeaways:
- constraints breed creativity (
64MBRAM,333MHz, 4×4 VFPU model) - old hardware + modern compiler thinking actually works
- incremental de-risking was huge (audio validation before full inference work)
- rabbit holes were half the fun (NIDs, CRT/linking, VFPU quirks, PSP IO stack)
The most satisfying moment was uploading a WAV recorded on 2005 hardware and seeing BirdNET identify the bird.
The PSP mic works, the VFPU has headroom, and the IO budget is tight but feasible. Next step is obvious: build the full on-device inference runtime.
Built with a cross-compiler, a lot of objdump, and a PSP that still has battery life after 20 years.