Squeezing FLOPS out of a PlayStation Portable

This project started with a simple question: can a 2004 handheld run modern ML inference?

Not as a toy demo, but something actually useful - specifically BirdNET, for recognizing bird songs from the PSP’s mic.

This post is the full path from first hello all the way to “can this thing hear birds?”: loading code on real hardware, reading MIPS output instruction-by-instruction, using the VFPU, porting to Rust, benchmarking IO, recording audio, and sanity-checking whether end-to-end inference is realistic.

Phase 1: Hello World — Understanding the Hardware

First step was getting anything to run on real hardware. The pspdev toolchain gives us a GCC cross-compiler targeting mipsel-psp-elf. On macOS:

curl -L https://github.com/pspdev/pspdev/releases/latest/download/pspdev-macos-latest-arm64.tar.gz | tar xz
xattr -rd com.apple.quarantine ./pspdev
export PSPDEV="$PWD/pspdev"
export PATH="$PSPDEV/bin:$PATH"

Hello world on the PSP looks like this:

#include <pspkernel.h>
#include <pspdebug.h>
#include <pspdisplay.h>

PSP_MODULE_INFO("Hello", 0, 1, 0);
PSP_MAIN_THREAD_ATTR(THREAD_ATTR_USER);

#define printf pspDebugScreenPrintf

int main(void) {
    pspDebugScreenInit();
    printf("Hello from PSP!\n");
    while (1) { sceDisplayWaitVblankStart(); }
    return 0;
}

Simple enough. But what actually happens between main() and pixels on the LCD?

The EBOOT.PBP Container

A container format that wraps an ELF with metadata for the XMB.

The ELF and DATA.PSP inside the EBOOT are slightly different. PSPSDK has a PSP_EBOOT target that strips debug sections and symbol tables from the ELF.

Linking and Relocation

A little compiler 101:

Generate an object file with ELF type REL (relocatable).
GCC puts startup code in .text.startup (not .text, surprisingly).
On this MIPS core that helps because the instruction cache is only 16 KiB; we keep startup noise away from hot code.

Before linking, function calls look like this:

8:  0c000000    jal  0 <main>     ; jump to address 0 (!)

jal 0 means jump to address 0. We haven’t linked yet, so the compiler writes relocation info:

Relocation section '.rel.text.startup':
 Offset     Info    Type          Sym. Name
00000008  00001004 R_MIPS_26     pspDebugScreenInit

8 bytes into .text.startup, patch the lower 26 bits with the target symbol address (pspDebugScreenInit), preserving the instruction bits (JAL opcode) via R_MIPS_26.

Execution: No MMU, Just Load and Go

The PSP has no MMU, so loading is pretty literal. sceKernelCreateThread allocates/initializes a TCB; sceKernelStartThread sets it READY and passes arguments.

Context switching is the fun part: we know what we need to load into registers, but we also need registers to run the context-switch code.

The Allegrex Instruction Cache

Cache geometry really matters here because the CPU budget is tiny. Allegrex has a 16 KiB, 2-way set-associative I-cache with 64-byte lines:

16 KiB / 64 B per line = 256 lines total
256 lines / 2 ways   = 128 sets

Address decomposition (32 bits):
┌─────────────────────┬───────────┬──────────┐
│        Tag          │ Set Index │  Offset  │
│      (19 bits)      │ (7 bits)  │ (6 bits) │
└─────────────────────┴───────────┴──────────┘

Each set has 2 lines scanned in parallel for tag match.
Each line holds 16 instructions.

Only 256 cache lines total, so every instruction matters.

What Else Is Inside the Tachyon SoC

Tachyon has more than just Allegrex:

VFPU (COP2) — 128 32-bit registers, ~22 cycles for a 4×4 matmul.
Media Engine — another MIPS core for media encoding/decoding (not directly accessible to us).
VME — kind of a mystery/reconfigurable DSP.
Graphics Engine — MMIO at 0xBD400000, command-driven.

Phase 2: Optimizing Scalar Matrix Multiply

Now for the real question: how fast can we multiply matrices on this thing?

Start with the obvious triple loop in C:

void mat_mul(const float *A, const float *B, float *C,
            int M, int N, int K) {
  for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++) {
      float sum = 0.0f;
      for (int k = 0; k < K; k++)
        sum += A[i*K + k] * B[k*N + j];
      C[i*N + j] = sum;
    }
}

Compiled with -O2, this comes out to 172 bytes of MIPS for mat_mul.

Here’s the tight scalar loop and generated hot path side-by-side:

for (int k = 0; k < K; k++)
  sum += A[i*K + k] * B[k*N + j];

34: lwc1   $f0, 0(v0)        ; load A[i*K + k]
38: lwc1   $f2, 0(v1)        ; load B[k*N + j]
3c: addiu  v0, v0, 4         ; advance A pointer (contiguous)
40: addu   v1, v1, t1        ; advance B pointer (strided)
44: mul.s  $f0, $f0, $f2     ; single-precision multiply
48: bne    a0, v0, 0x34      ; loop if more k values
4c: add.s  $f1, $f1, $f0     ; (delay slot) sum += product

Reading the Assembly

Arguments:
  a0 = A (pointer to first matrix)
  a1 = B (pointer to second matrix)
  a2 = C (pointer to result matrix)
  a3 = M (rows of A / rows of C)
  t0 = N (cols of B / cols of C)     — 5th arg in register (EABI)
  t1 = K (cols of A / rows of B)     — 6th arg in register (EABI)

That last add.s sits in the branch delay slot. On MIPS, the instruction after a branch always executes, so GCC gets accumulation + loop control overlap for free.

Eliminating Dead Code with __builtin_unreachable

Even at -O2, there are obvious inefficiencies:

early exits for M <= 0 and N <= 0
a zero-fill fallback path when K <= 0

For this benchmark, all dims are positive, so those paths are dead weight.

The zero-fill fallback is especially wasteful. If K <= 0, the inner loop never runs, but sum was initialized to 0.0f, so GCC emits a whole loop to store zeros.

We can tell GCC these branches are impossible via __builtin_unreachable:

38: addiu  t4, t4, 1       ; j++
3c: sw     zero, 0(t3)     ; C[i*N + j] = 0
40: addiu  t5, t5, 4       ; advance B pointer (unused!)
44: bne    t0, t4, 0x38    ; loop until done
48: addiu  t3, t3, 4       ; (delay slot) advance C pointer

if (M <= 0) __builtin_unreachable();
if (N <= 0) __builtin_unreachable();
if (K <= 0) __builtin_unreachable();

The Cascade Effect

Expected savings:

3 early-exit instructions
6 instructions from guard + zero-fill loop

Actual savings were bigger. We also lose:

2 instructions from duplicated return blocks
1 trampoline branch for final store

Once dead paths are removed, the last store just falls through naturally.

Result: 3 + 6 + 2 + 1 = 12 instructions removed. 172 bytes → 124 bytes. mat_mul now fits in 2 cache lines instead of 3 - wahoo.

The original code layout had this structure:

0x00  [entry + guards]
0x10  [outer loop setup]
0x24  [middle loop start]
0x30    bgtz t1, 0x78  ───────┐  "if K > 0, goto inner loop"
0x38    [zero-fill fallback]  │
0x4c  [outer loop advance]    │
0x60  [return]                │
0x68  [store + advance j]     │  ← trampoline target
0x78  [inner loop setup]  ◄───┘
0x84  [inner loop body]
0x98    bne → 0x84            "loop back if more k"
0xa0    b → 0x68              ← extra branch to store
0xa8  [duplicate return]

After the optimization, it collapses to a clean linear flow:

0x00  [setup]
0x18  [outer loop]
0x28    [inner loop setup]
0x34    [inner loop body]
0x48    bne → 0x34        "loop back if more k"
0x50  [store + advance j] ← falls through, no trampoline
0x5c    bne → 0x28        "next j"
0x64  [outer loop advance]
0x70    bne → 0x18        "next i"
0x78  [return]

Why Not Unroll?

Normally we’d consider unrolling a 7-instruction loop. Here it doesn’t buy much:

branch overhead is minimal (delay slot already used productively)
MIPS32 has no madd.s (that shows up in later ISAs)
unrolling increases code size and register pressure

Not worth it on this hardware.

Phase 3: Unleashing the VFPU

We squeezed out what we could from scalar code, so now we go VFPU.

Register Architecture

VFPU has 128 32-bit registers, grouped as 8 primitive 4×4 matrices.

Important detail: sub-vectors/sub-matrices never cross primitive matrix boundaries. Addressing modes can alias the same physical registers, so some instructions require non-overlapping input/output regs.

Instruction suffixes are .s/.p/.t/.q, register prefixes are S/R/C/M.

Best reference is still the pspdev VFPU docs.

The Column-Major Surprise

There are a few SDK options. openTri has triMat4Mul, and disassembling it is… weird:

move   v0, a0            ; save output pointer
lv.q   C100, 0(a1)       ; load rows of B as column vectors
lv.q   C110, 16(a1)
lv.q   C120, 32(a1)
lv.q   C130, 48(a1)
lv.q   C200, 0(a2)       ; load rows of C as column vectors
lv.q   C210, 16(a2)
lv.q   C220, 32(a2)
lv.q   C230, 48(a2)
vmmul.q E000, E200, E100  ; matrix multiply
sv.q   C000, 0(a0)       ; store result as column vectors
sv.q   C010, 16(a0)
sv.q   C020, 32(a0)
sv.q   C030, 48(a0)
jr     ra

It looks like row-major data is loaded as columns, order is reversed, then output is flipped back. Very convoluted.

I initially thought it was wrong and had Claude generate a PPSSPP test harness. The VFPU output looked transposed:

Scalar:       VFPU (tri):
19  22        23  34
43  50        31  46

Then it clicked: VFPU is naturally column-major (Sx01 is one row below Sx00), while C arrays are row-major.

So the “gymnastics” are transpose-in, multiply, transpose-out. Correct, but expensive if you include transpose cost.

A Direct Approach

Instead, we can just write inline asm that accepts normal C arrays:

void mat_mul_4x4_vfpu(float *C, const float *A, float *B) {
  __asm__ volatile(
    "lv.q   R000,  0(%1)\n"    // row 0 of A → row 0 of M0
    "lv.q   R001, 16(%1)\n"    // row 1 of A → row 1 of M0
    "lv.q   R002, 32(%1)\n"    // row 2 of A → row 2 of M0
    "lv.q   R003, 48(%1)\n"    // row 3 of A → row 3 of M0
    "lv.q   R100,  0(%2)\n"    // row 0 of B → row 0 of M1
    "lv.q   R101, 16(%2)\n"    // row 1 of B → row 1 of M1
    "lv.q   R102, 32(%2)\n"    // row 2 of B → row 2 of M1
    "lv.q   R103, 48(%2)\n"    // row 3 of B → row 3 of M1
    "vmmul.q M200, M000, M100\n" // M2 = M0 × M1
    "sv.q   R200,  0(%0)\n"    // store result rows
    "sv.q   R201, 16(%0)\n"
    "sv.q   R202, 32(%0)\n"
    "sv.q   R203, 48(%0)\n"
    : : "r"(C), "r"(A), "r"(B) : "memory"
  );
}

Profiling

PSP has profiler registers at 0xBC400000, but PPSSPP doesn’t back them. Trying pspDebugProfilerEnable segfaults because that mapped address isn’t actually allocated.

So in emulator I used sceKernelGetSystemTimeLow over 10,000 iterations:

Implementation	Emulator (PPSSPP)	Real PSP
Scalar 4×4	28,440 μs	48,024 μs
VFPU (openTri, includes transpose work)	16,877 μs	—
VFPU (inline asm)	1,307 μs	2,853 μs

Our inline asm is almost 12× faster than openTri when openTri includes transpose work (16,877 μs). If we move transpose work outside the benchmark loop, both converge around ~1,300 μs for 10,000 calls in emulation.

Sanity-Checking the Numbers

Quick sanity check:

vmmul.q throughput ~16 cycles, latency ~22 cycles
lower bound is roughly 8 loads + vmmul latency + 4 stores + loop overhead ≈ 37 cycles/op
Allegrex is 333 MHz

1,300 μs × 333 cycles/μs = 432k cycles
432k / 10k iterations   = 43.2 cycles/iteration

That lands near ~14% overhead, which is plausible for emulation.

Real hardware is different. Scalar→VFPU speedup is 21.7× in emulator vs 16.8× on PSP. Makes sense: emulator can implement VFPU ops on modern CPUs without all real hardware constraints.

Enabling the VFPU for PRX Modules

One gotcha with PSPLINK: VFPU must be explicitly enabled via thread attributes. Without THREAD_ATTR_VFPU, first VFPU instruction throws “Coprocessor unusable”:

Exception - Coprocessor unusable
Thread ID - 0x042DB813
EPC       - 0x08806B48
Cause     - 0x2000002C

EPC points to the faulting instruction; disassembly confirms it’s the first lv.q. Fix is one flag:

PSP_MAIN_THREAD_ATTR(THREAD_ATTR_USER | THREAD_ATTR_VFPU);

Phase 4: Porting to Rust and Measuring IO

With C side validated, I moved to Rust via cargo-psp.

How cargo-psp Works

At a high level, cargo-psp does:

cargo build \
  -Z build-std=core,compiler_builtins,alloc,panic_unwind,panic_abort \
  --target mipsel-sony-psp \
  --message-format=json-render-diagnostics

Then fixups/import handling, prxgen (ELF→PRX), MKSFO, then pbp-pack to make EBOOT.PBP.

The PSP’s Dynamic Linking Model

PSP syscall loading is beautifully weird. You call sceDisplayWaitVblankStart, but linked PRX still has unresolved stubs; firmware patches syscall targets at load time.

The kernel uses four sections for this:

// 1. Library names
.rodata.sceResident   → "sceDisplay", "sceNet", ...

// 2. Function IDs (32-bit hashes)
.rodata.sceNid        → sceDisplayWaitVblankStart = 0x984C27E7

// 3. Stub code (placeholder, patched by firmware)
.sceStub.text         → jr $ra; syscall N

// 4. Index tying it all together
.lib.stub             → { name, nid_table, stub_table, count }

Kernel walks .lib.stub, resolves library names/NIDs, and patches stubs. So you only ship stubs for functions you actually call.

.text                          .sceStub.text
─────                          ──────────────
jal 0xf1f8  ──────────────────►  [placeholder data]
                                       │
                                       │ PSP loader patches
                                       ▼
                                  jr $ra
                                  syscall N

VFPU in Rust

Rust version uses vfpu_asm! (nightly + #![feature(asm_experimental_arch)]) and looks basically the same as C inline asm.

unsafe {
    vfpu_asm!(
        "lv.q R000,  0({0})",
        "lv.q R001, 16({0})",
        "lv.q R002, 32({0})",
        "lv.q R003, 48({0})",
        "lv.q R100,  0({1})",
        "lv.q R101, 16({1})",
        "lv.q R102, 32({1})",
        "lv.q R103, 48({1})",
        "vmmul.q M200, M000, M100",
        "sv.q R200,  0({2})",
        "sv.q R201, 16({2})",
        "sv.q R202, 32({2})",
        "sv.q R203, 48({2})",
        in(reg) (a.as_ptr()),
        in(reg) (b.as_ptr()),
        in(reg) (c.as_mut_ptr()),
        options(nostack),
    );
}

IO Benchmarks: How Fast Can We Feed the Beast?

IO matters a lot here; we’re moving a lot of data relative to PSP memory/bandwidth.

During development, primary path is PSPLINK over USB (host0: via usbhostfs).

IoFileMgrForUser parses path prefixes (ms0:, host0:, etc.) and forwards to the right driver. usbhostfs read block max is 64 KiB:

#define HOSTFS_MAX_BLOCK   (64*1024)     // max read block
#define HOSTFS_BULK_MAXWRITE (1024*1024)  // max write block

Before DMA, driver calls sceKernelDcacheWritebackRange so USB doesn’t read stale dirty-cache data.

Benchmarking 4 MiB transfers (64 KiB blocks) on real hardware:

Direction	Bandwidth	Time (4 MiB)
Read (USB → PSP)	22.81 MiB/s	175 ms
Write (PSP → USB)	14.90 MiB/s	268 ms

Can We Actually Run a Neural Network?

With an eye to BirdNET v2.4, let’s do the math:

Parameter	Value
Model size (FP32)	50.5 MB
Load time over USB	~2.21 s
Total FLOPs	826 MFLOP
Measured VFPU throughput	348 MFLOP/s
Inference time (estimate)	~2.37 s
RAM remaining after model	~16 MB

Throughput estimate comes from what I measured on PSP with 10,000 4×4 matmuls: each is 64 multiplies + 48 adds = 112 FLOPs, giving ~348 MFLOP/s.

Verdict: ~5s from cold start to first inference. Tight, but workable.

Also, 0.348 GFLOP/s is far below the VFPU’s 3.2 GFLOP/s theoretical peak. There’s headroom if we’re clever.

Phase 5: Recording Audio on the PSP

To record bird sounds, we need mic access.

I wanted to prototype whether PSP audio quality is good enough for birdnet via https://birdnet.cornell.edu/demo/, but the only recorder homebrew I found was PSP Audio Recorder (2006), and it runs as a kernel module, which is cumbersome (can’t run it over psplink, for example).

So I rewrote a similar recorder in Rust.

The Kernel Module Problem

The original C audio recorder uses a two-stage loader pattern typical of PSP kernel modules. A user-mode stub loads the actual kernel PRX:

PSP_MODULE_INFO("BOOT_PRX", 0x1000, 1, 1);  // 0x1000 = kernel mode

int main_thread(SceSize args, void *argp) {
    pspKernelSetKernelPC();
    pspSdkInstallNoDeviceCheckPatch();
    pspSdkInstallNoPlainModuleCheckPatch();
    // ... load and start record.prx
    mod = sceKernelLoadModule(path, 0, NULL);
    mod = sceKernelStartModule(mod, ...);
}

That flow needs kernel privileges, CFW patch calls, and a separate PRX. Bad dev loop.

Good news: sceAudioInputBlocking works fine in user mode on modern CFW, so we can skip the kernel-module path.

A User-Mode Recorder in Rust

Rust implementation is straightforward: sceAudioInputInit, then sceAudioInputBlocking in a loop with 1024-sample chunks at 44.1 kHz, 16-bit mono:

fn record_chunk(&mut self) {
    let mut buf = [0i16; 1024];
    unsafe {
        sceAudioInputBlocking(
            buf.len() as i32,
            self.get_input_freq(),
            buf.as_mut_ptr() as *mut c_void,
        );
    }
    self.samples.extend_from_slice(&buf);
}

Samples go into Vec<i16> (extern crate alloc), then we write WAV with a manual 44-byte header (RIFF/WAVE + PCM payload).

Offline-First Design

Practical design constraint: PSP won’t always be tethered over USB.

So recorder is offline-first:

if host0: is available, write WAV directly to host
otherwise cache to ms0:/PSP/MUSIC/AUDREC/
next time USB is connected, upload cached recordings

We detect connectivity by attempting to open host0:/ as a directory — if sceIoDopen returns a non-negative file descriptor, we’re connected:

fn is_host0_available() -> bool {
    let fd = unsafe { sceIoDopen(b"host0:/\0".as_ptr()) };
    if fd.0 >= 0 {
        unsafe { sceIoDclose(fd) };
        true
    } else { false }
}

Phase 6: Can the PSP’s Mic Hear Birds?

Before writing the whole inference stack, first question was simple: is PSP mic quality even good enough for BirdNET?

I recorded a few bird songs from my living room (they’re pretty far away) and uploaded them to birdnet.cornell.edu/demo.

It worked. BirdNET picked up a Dark-eyed Junco (Junco hyemalis) from a PSP recording.

Validation: PSP mic quality is good enough for bird-song recognition. End-to-end is feasible; the rest is engineering.

Phase 7: Error Handling in `no_std` Rust

As the codebase grew, debugging got painful. PSP syscalls return negative i32 values like -2147418110, which is not super readable.

PSP Error Code Anatomy

PSP error codes follow 0x8XYYNNNN:

Prefix	Category
`0x8000XXXX`	Common/generic errors
`0x8001XXXX`	POSIX errno (maps directly to C errno values)
`0x8002XXXX`	Kernel errors
`0x8021XXXX`	UMD errors

0x8001XXXX maps directly to POSIX errno values (0x80010002 = ENOENT, 0x8001000D = EACCES, etc.).

Why Not Just Use `strerror()`?

First instinct was to call newlib strerror() via FFI:

#[link(name = "c")]
extern "C" {
    fn strerror(errnum: c_int) -> *const c_char;
}

Doesn’t work. rust-psp is pure no_std and doesn’t link newlib/libc.

When force-linking libc.a:

linker couldn’t find it without explicit search path
once found, ABI/binding errors showed up (symbol has invalid binding: 0)

So practically: not worth invasive rust-psp build changes just for error strings.

A Manual Lookup Table

Practical solution is a manual lookup table:

fn psp_strerror(code: i32) -> &'static str {
    let code = code as u32;
    match code {
        0x80010002 => "ENOENT: No such file or directory",
        0x8001000D => "EACCES: Permission denied",
        0x80000022 => "SCE_ERROR_OUT_OF_MEMORY",
        // ... ~30 more entries
        _ => match code & 0xFFFF0000 {
            0x80010000 => "Unknown POSIX errno",
            0x80020000 => "Unknown kernel error",
            _ => "Unknown error",
        }
    }
}

Fallback masks upper 16 bits so unknown codes are still categorized by subsystem. Works in no_std with zero dependencies.

// Before:
dprintln!("Error: {}", fd.0);
//   Error: -2147418110

// After:
dprintln!("Error: {:#010x} {}", fd.0 as u32, psp_strerror(fd.0));
//   Error: 0x80010002 ENOENT: No such file or directory

Appendix: im2col Visualization

I had Gemini sketch an im2col visualizer, so I cleaned it up and embedded it here as an appendix.

Use the mode switch to compare direct convolution vs im2col/GEMM form, then step through the 16 positions.

Step 1 / 16

Output[0, 0] = dot(patch #0, kernel) = 6

Input Image (5 x 5)

0123412345234563456745678

Kernel (2 x 2)

1112

One step = one full patch x kernel multiply.

Output Feature Map (4 x 4)

What I Learned

Main takeaways:

constraints breed creativity (64MB RAM, 333MHz, 4×4 VFPU model)
old hardware + modern compiler thinking actually works
incremental de-risking was huge (audio validation before full inference work)
rabbit holes were half the fun (NIDs, CRT/linking, VFPU quirks, PSP IO stack)

The most satisfying moment was uploading a WAV recorded on 2005 hardware and seeing BirdNET identify the bird.

The PSP mic works, the VFPU has headroom, and the IO budget is tight but feasible. Next step is obvious: build the full on-device inference runtime.

Built with a cross-compiler, a lot of objdump, and a PSP that still has battery life after 20 years.