Most projects online are finished.
Clean README. Green CI checks. Perfect screenshots. You rarely see the weeks where nothing works.
This project started because I wanted something difficult enough that I couldn’t fake understanding.
Web backends let me hide mistakes behind abstractions. Databases work. APIs respond. Frameworks absorb bad decisions.
A Game Boy emulator doesn’t care.
The CPU either jumps to the correct address or it doesn’t. The interrupt fires or it doesn’t. Cycle timing is right or a ROM breaks.
There is nowhere to hide.
The Core Loop
The entire emulator boils down to this function:
pub fn step(self: *GameBoy) void {
const cycles = self.cpu.step(&self.mem);
self.timer.step(cycles, &self.mem.io[0x0F]);
self.audio.step(cycles);
self.ppu.step(&self.mem, &self.framebuffer);
self.input.step(&self.mem.io[0x0F]);
}
Run one CPU instruction. Figure out how many cycles it took. Then give every other subsystem the same number of cycles to stay synchronized. The timer advances. The audio channels produce samples. The PPU renders scanlines. Input checks for button presses.
Every part of the hardware runs in lockstep. If the CPU executes an instruction in 8 cycles, the timer gets exactly 8 cycles too, not 7, not 9.
That linear step() call is the only reason anything works. The Game Boy has no operating system, no scheduler, no interrupts that preempt the CPU in the way we understand them. Everything is cooperative, driven by this single tick.
Emulator Core Loop — One Tick
Break that timing contract and games crash.
The CPU: 512 Opcodes, Zero Abstraction
The LR35902 CPU in the Game Boy is a modified Intel 8080 with a hybrid Z80 instruction set. It has 256 main opcodes plus 256 CB-prefixed opcodes.
Implementing them is mostly boring data movement. But the boring parts are where bugs hide.
// 8-bit loads: LD r, r'
0x7F => { self.a = self.a; }, // LD A, A (NOP in disguise)
0x78 => { self.a = self.b; }, // LD A, B
0x79 => { self.a = self.c; }, // LD A, C
0x7A => { self.a = self.d; }, // LD A, D
0x7B => { self.a = self.e; }, // LD A, E
0x7C => { self.a = self.h; }, // LD A, H
0x7D => { self.a = self.l; }, // LD A, L
0x7E => { self.a = mem.read(self.getHL()); }, // LD A, (HL)
This pattern repeats for every destination register — B, C, D, E, H, L, and (HL). It is one of the most mind-numbing blocks of code I have ever written. And one wrong mapping would make every game crash silently.
The interesting part isn’t the loads. It’s the flags.
The Flag Register That Broke My Week
The F register stores four flags: Zero (bit 7), Subtract (bit 6), Half-Carry (bit 5), Carry (bit 4). The lower nibble is always zero.
Getting Half-Carry wrong is a rite of passage.
fn add(self: *CPU, val: u8) void {
const result = @as(u16, self.a) + @as(u16, val);
self.setH((self.a & 0x0F + val & 0x0F) > 0x0F);
self.setC(result > 0xFF);
self.a = @truncate(result);
self.setZ(self.a == 0);
self.setN(false);
}
fn sub(self: *CPU, val: u8) void {
const result: i16 = @as(i16, self.a) - @as(i16, val);
self.setH((@as(i16, self.a & 0x0F) - @as(i16, val & 0x0F)) < 0);
self.setC(result < 0);
self.a = @as(u8, @bitCast(@as(i8, @truncate(result))));
self.setZ(self.a == 0);
self.setN(true);
}
Half-Carry checks if the addition overflowed into bit 4 from bit 3. For add, you check if the sum of the lower nibbles exceeds 0x0F. For sub, you check if the lower nibble of A is less than the lower nibble of the operand — a borrow.
These three lines look trivial in hindsight. But the first time I wrote them, I used (self.a ^ val ^ result) & 0x10 — a different calculation I copied from a reference CPU implementation. It was correct for ADD but subtly wrong for ADC (add with carry). The carry bit shifts the overflow boundary.
A game wouldn’t crash immediately. It would run for 10,000 instructions, then a flag would be wrong, then a conditional jump would go the wrong way, then the wrong code would execute, and eventually you’d get a blank screen or a freeze.
Debugging that feels like being a historian reconstructing a war from a single coin. You know something happened. You have almost no evidence.
The HALT Trap
Consider this tiny piece of the CPU step:
if (self.halted) {
if (pending != 0) {
self.halted = false;
} else {
return 4; // HALT: 4 cycles per step
}
}
When the CPU executes a HALT instruction (0x76), it sits in a loop wasting cycles until an interrupt arrives. But there’s a hardware quirk: if interrupts are disabled (IME = 0) and an interrupt is already pending, the CPU still halts but the program counter doesn’t advance properly.
I had a bug where instr_timing.gb failed because my HALT implementation returned immediately instead of burning cycles. The test ROM expected exactly N * 4 cycles spent in HALT. I was returning 0. The serial output was correct — the right characters printed — but the timing test failed because the cycle count was off by hundreds.
That’s the Game Boy. Correct output isn’t enough. It has to be correct at the right time.
Interrupts: The Best Kind of Spaghetti
Interrupt handling is where the CPU, memory, timer, and PPU all intersect:
const ie = mem.read(0xFFFF); // Interrupt Enable register
const if_reg = mem.read(0xFF0F); // Interrupt Flag register
const pending = ie & if_reg & 0x1F;
if (self.ime and pending != 0) {
var ib: u3 = 0;
while (ib < 5) : (ib += 1) {
if ((pending >> ib) & 1 == 1) {
self.ime = false;
self.halted = false;
mem.write(0xFF0F, if_reg & ~(@as(u8, 1) << ib));
self.push(mem, self.pc);
self.pc = switch (ib) {
0 => 0x0040, // VBlank
1 => 0x0048, // LCD Stat
2 => 0x0050, // Timer
3 => 0x0058, // Serial
4 => 0x0060, // Joypad
else => unreachable,
};
return 20; // Interrupt takes 20 cycles
}
}
}
Five interrupts, five vector addresses, one priority scheme. The lowest bit wins.
The IE register controls which interrupts are enabled at all. The IF register records which ones fired. The CPU checks every cycle: “Is anything set in both?”
I spent two days debugging why 01-special.gb was stuck at address 0x0038. That address isn’t even an interrupt vector — it’s a restart address used by RST $38. Turns out my interrupt dispatch was running during a DI (disable interrupts) sequence, and the enable/disable had a one-instruction delay that the Game Boy hardware respects.
The LR35902 has a quirk where EI (enable interrupts) takes effect after the next instruction. Not immediately. Not on the next cycle. After the next instruction.
0xFB => {
self.ime = true; // Wrong! Should be delayed
},
The fix was to buffer the enable:
0xFB => {
self.ime_scheduled = true; // Enable after next instruction
},
A single bit of state. Cost me two days.
The Memory Map: A Switch Statement That Shouldn’t Work
The Game Boy addresses 64KB of memory. Different regions map to different hardware:
| Address Range | Maps To |
|---|---|
| 0x0000 - 0x3FFF | ROM Bank 0 (16KB) |
| 0x4000 - 0x7FFF | Switchable ROM Bank (16KB) |
| 0x8000 - 0x9FFF | VRAM (tile data) |
| 0xA000 - 0xBFFF | Cartridge RAM (battery-backed) |
| 0xC000 - 0xDFFF | Work RAM |
| 0xE000 - 0xFDFF | Echo RAM (mirror — don’t use) |
| 0xFE00 - 0xFE9F | OAM (sprite attributes) |
| 0xFF00 - 0xFF7F | I/O Registers |
| 0xFF80 - 0xFFFE | High RAM |
| 0xFFFF | Interrupt Enable |
My implementation is just a series of range checks:
pub fn read(self: *const Memory, addr: u16) u8 {
if (addr < 0x8000) {
if (self.cartridge) |cart| return cart.readRom(addr);
return 0xFF;
}
if (addr >= 0x8000 and addr < 0xA000) return self.vram[addr - 0x8000];
if (addr >= 0xA000 and addr < 0xC000) {
if (self.cartridge) |cart| return cart.readRam(addr);
return 0xFF;
}
if (addr >= 0xC000 and addr < 0xE000) return self.wram[addr - 0xC000];
if (addr >= 0xE000 and addr < 0xFE00) return self.wram[addr - 0xE000];
if (addr >= 0xFE00 and addr < 0xFEA0) return self.oam[addr - 0xFE00];
if (addr >= 0xFF00 and addr < 0xFF80) {
// I/O delegation to subsystems
if (addr == 0xFF00) {
if (self.input) |input| return input.read(addr);
}
if (addr >= 0xFF04 and addr <= 0xFF07) {
if (self.timer) |timer| return timer.read(addr);
}
if (addr >= 0xFF10 and addr <= 0xFF3F) {
if (self.audio) |audio| return audio.read(addr);
}
return self.io[addr - 0xFF00];
}
if (addr >= 0xFF80) return self.hram[addr - 0xFF80];
return 0xFF;
}
The I/O region delegation is the part I’m most proud of. Instead of having the CPU know about every hardware register, the memory bus hands off reads/writes to the subsystem that owns that address range. The CPU doesn’t know what a timer is. It just reads and writes addresses.
But there’s a trap. The OAM DMA transfer at 0xFF46:
if (addr == 0xFF46) {
const src_base = @as(u16, val) << 8;
for (0..0xA0) |i| {
self.oam[i] = self.read(src_base + @as(u16, @intCast(i)));
}
self.io[addr - 0xFF00] = val;
return;
}
This copies 160 bytes from any page in ROM/RAM to the sprite attribute table. Games use it every frame. If the copy reads from ROM while the PPU is rendering, you get visual corruption. I spent a day wondering why sprites flickered in Pokémon, and it was because my DMA ran unconditionally instead of respecting the PPU mode.
The Timer: Hardware Is Always Running
The timer is deceptively simple — four registers, one internal counter:
pub const Timer = struct {
div: u16 = 0,
tima: u8 = 0,
tma: u8 = 0,
tac: u8 = 0,
overflow_delay: u8 = 0,
};
But the behavior on overflow is where it gets weird:
pub fn step(self: *Timer, cycles: u8, if_reg: *u8) void {
// Handle overflow delay
if (self.overflow_delay > 0) {
self.overflow_delay -= 1;
if (self.overflow_delay == 0) {
self.tima = self.tma; // Reload with modulo value
if_reg.* |= 0x04; // Fire timer interrupt
}
}
// Advance internal counter
const old_div = self.div;
self.div +%= @as(u16, cycles);
const timer_enabled = (self.tac & 0x04) != 0;
if (timer_enabled) {
const clock_select = self.tac & 0x03;
const bit_pos: u4 = switch (clock_select) {
0b00 => 9, // 4096 Hz
0b01 => 3, // 262144 Hz
0b10 => 5, // 65536 Hz
0b11 => 7, // 16384 Hz
};
const old_bit = (old_div >> bit_pos) & 1;
const new_bit = (self.div >> bit_pos) & 1;
if (old_bit == 1 and new_bit == 0) {
if (self.tima == 0xFF) {
self.tima = 0;
self.overflow_delay = 4; // 4-cycle delay
} else {
self.tima +%= 1;
}
}
}
}
When TIMA overflows (0xFF → 0x00), it doesn’t reload TMA immediately. There’s a 4-cycle delay where TIMA stays at 0 before the hardware copies TMA into it and fires the interrupt.
That delay exists in the real Game Boy. If you skip it, some games — specifically those that use the timer to synchronize music — will play at the wrong tempo. The overflow happens, the interrupt fires too early, the CPU jumps to the ISR too soon, and the audio buffer depletes.
I found this bug because instr_timing.gb had a timer test that failed by exactly 4 cycles.
The PPU: Pixels Through a Firehose
The PPU is the most complex subsystem in the Game Boy. It has four modes: OAM scan (80 dots), pixel transfer (172 dots), HBlank (timing gap), and VBlank (10 scanlines of idle).
pub const PPU_MODE_HBLANK: u8 = 0;
pub const PPU_MODE_VBLANK: u8 = 1;
pub const PPU_MODE_OAM: u8 = 2;
pub const PPU_MODE_TRANSFER: u8 = 3;
pub const PPU_DOTS_MODE2: u16 = 80;
pub const PPU_DOTS_MODE3: u16 = 172;
pub const PPU_DOTS_LINE: u16 = 456;
pub const PPU_SCANLINES: u8 = 154;
pub const PPU_HEIGHT: u8 = 144;
Each scanline takes exactly 456 dots (cycles). Mode 2 uses 80, Mode 3 uses 172, the rest is HBlank. After 144 visible scanlines, VBlank starts and lasts for 10 scanlines.
Every dot, the PPU checks if it should trigger an LCD STAT interrupt. Every scanline, it checks if LY equals LYC for the coincidence flag.
Rendering a single scanline involves reading the background tile map, looking up tile data from VRAM, applying palette colors, and blending with sprites:
fn renderScanline(self: *PPU, mem: *const Memory, fb: []u8) void {
const ly = self.ly;
if (ly >= PPU_HEIGHT) return;
var bg_idx: [PPU_WIDTH]u2 = undefined;
if (self.lcdc & LCDC_BG_ENABLE != 0) {
const bg_map: u16 = if (self.lcdc & LCDC_BG_MAP != 0) 0x1C00 else 0x1800;
for (0..PPU_WIDTH) |x_usize| {
const col: u16 = (x +% @as(u16, self.scx)) % 256;
const row: u16 = (ly_u16 +% @as(u16, self.scy)) % 256;
const tile_x = col / 8;
const tile_y = row / 8;
const map_addr = bg_map + tile_y * 32 + tile_x;
const tile_num = mem.read(0x8000 + map_addr);
// Tile data lookup — signed or unsigned addressing
const tile_offset: u16 = if (self.lcdc & LCDC_ADDR_MODE != 0)
@as(u16, tile_num) * 16
else {
const signed_num: i16 = @as(i16, @as(i8, @bitCast(tile_num)));
break :blk @as(u16, @bitCast(@as(i16, 0x1000) + signed_num * 16));
};
// ... read tile pixel, index into palette, write to framebuffer
}
}
}
The tile addressing mode is the kind of hardware quirk that looks like a typo in the datasheet but is actually real. If LCDC & 0x10 is set, tiles are numbered 0-255 in the $8000 region. If it’s clear, tiles are signed (-128 to 127) in the $8800 region.
A tile number of 0x80 means tile 128 in one mode and tile “-128” (i.e., address $8000 - $800) in the other. If you get the math wrong, every tile is shifted by one row, and the background looks like an explosion at a puzzle factory.
Cartridge MBC: Banking on Pointers
Game Boy cartridges can be larger than the CPU’s 64KB address space. The solution: memory bank controllers that switch ROM banks in and out.
MBC1 and MBC5 work differently. MBC1 has a “banking mode” bit that switches whether writes to 0x4000-0x5FFF control the upper ROM bank bits or the RAM bank. MBC5 uses a 9-bit ROM bank register.
pub const Cartridge = struct {
rom_data: ?[]u8 = null,
ram_data: ?[]u8 = null,
cart_type: CartridgeType = .ROM_ONLY,
rom_bank: u16 = 1,
ram_bank: u8 = 0,
banking_mode: u8 = 0,
ram_enabled: bool = false,
};
MBC1 has a bug in its own design: in banking mode 1 (RAM banking mode), writing to the “upper ROM bits” register actually controls RAM bank selection. But the ROM bank upper bits get latched to 0, so ROM bank numbers above 0x1F become inaccessible. Games that use MBC1 with large ROMs have to switch banking modes to access the upper ROM banks.
I discovered this by loading porklike.gb (an MBC1 homebrew roguelike) and getting a black screen. It uses 2MB ROM with MBC1. I was getting the bank switching wrong because I treated the banking mode as affecting only RAM, when it actually changes whether certain writes control ROM or RAM.
The fix:
pub fn writeRom(self: *Cartridge, addr: u16, val: u8) void {
switch (addr >> 12) {
0x0...0x1 => { self.ram_enabled = (val & 0x0F) == 0x0A; },
0x2...0x3 => { self.rom_bank = (self.rom_bank & 0x60) | (val & 0x1F); },
0x4...0x5 => {
if (self.banking_mode == 0) {
self.rom_bank = (self.rom_bank & 0x1F) | ((val & 0x03) << 5);
} else {
self.ram_bank = val & 0x03;
}
},
0x6...0x7 => { self.banking_mode = val & 0x01; },
}
}
The 0x4-0x5 case is where the behavior diverges based on banking mode. If you treat it as always controlling ROM bank bits, large MBC1 carts break. If you treat it as always controlling RAM bank, small carts with RAM break.
That two-line if/else took me an evening of staring at hex dumps to get right.
Audio: 4 Channels of Pain
The Game Boy has four audio channels: two square wave generators, a custom wave channel, and a noise generator. Each has its own set of I/O registers, frequency dividers, volume envelopes, and length timers.
pub const Audio = struct {
ch1_sweep: u8 = 0,
ch1_duty_len: u8 = 0,
ch1_env: u8 = 0,
ch1_freq_low: u8 = 0,
ch1_freq_high: u8 = 0,
ch1_enabled: bool = false,
ch1_freq_timer: u16 = 0,
ch1_phase: u8 = 0,
// ... three more channels ...
};
Channel stepping is where the magic happens:
pub fn step(self: *Audio, cycles: u8) void {
if (self.ch1_enabled) {
self.ch1_freq_timer +%= @as(u16, cycles);
const freq = ((@as(u16, self.ch1_freq_high) & 0x07) << 8) | self.ch1_freq_low;
const period = (2048 - freq) * 4;
if (self.ch1_freq_timer >= period) {
self.ch1_freq_timer = 0;
self.ch1_phase = (self.ch1_phase + 1) % 8;
}
}
// Channel 2, 3, 4...
}
The frequency timer is just a counter that resets when it reaches (2048 - freq) * 4 cycles. Each time it resets, the phase advances, producing a square wave at the appropriate frequency.
The noise channel uses a Linear Feedback Shift Register (LFSR) to produce pseudo-random noise:
// LFSR step for noise channel
const xor_bit = (self.ch4_lfsr & 1) ^ ((self.ch4_lfsr >> 1) & 1);
self.ch4_lfsr = (self.ch4_lfsr >> 1) | (xor_bit << 14);
if (self.ch4_poly & 0x08 != 0) {
// Also feedback to bit 6 (narrow mode)
self.ch4_lfsr = (self.ch4_lfsr & ~@as(u16, 0x40)) | (xor_bit << 6);
}
This is a 15-bit LFSR with optional feedback narrowing. The percussion sounds in Game Boy games come from this circuit. If you get the feedback polynomial wrong, drums sound like static.
I currently have all four channels implemented, but the audio still isn’t piped to SDL2 properly. The internal state machines are correct — verified against test ROMs — but the output mixing and buffer scheduling aren’t done. It plays in silence. That’s next.
The C to Zig Rewrite: What Changed
The project started in C. I got about as far as the initial CPU loop and memory map before I hit a wall of “I don’t know what this pointer points to anymore.” Zig’s compile-time checks caught issues that C just let slide:
// Zig: this assignment is checked at compile time
pub fn step(self: *CPU, mem: *Memory) u8 { ... }
// C equivalent: void* everywhere, pray it works
uint8_t cpu_step(CPU* cpu, void* mem) { ... }
The ?* optional pointer type (nullable equivalent) made memory delegation safe:
timer: ?*Timer = null, // NULL until a ROM is loaded
audio: ?*Audio = null,
input: ?*Input = null,
Every access requires an explicit unwrap:
if (self.timer) |timer| {
return timer.read(addr);
}
No more dereferencing a null pointer because you forgot to initialize a subsystem.
Comptime (compile-time evaluation) also let me do things I couldn’t in C:
pub const FRAMEBUFFER_SIZE = @as(usize, PPU_WIDTH) * PPU_HEIGHT * 4;
// Evaluated at compile time. No magic numbers.
And the function-based register pair access eliminated entire categories of bugs:
pub fn getBC(self: CPU) u16 {
return (@as(u16, self.b) << 8) | @as(u16, self.c);
}
pub fn setBC(self: *CPU, v: u16) void {
self.b = @truncate(v >> 8);
self.c = @truncate(v & 0xFF);
}
These are function calls. In C they’d be macros. Macros don’t check types. These do.
But the rewrite didn’t make the emulator work. It made the code safer, which meant I could spend debugging time on actual emulation bugs instead of memory corruption. The carry flag was still wrong in Zig. The interrupt timing was still off.
The language changed the bugs, not the complexity.
What the Commit Graph Looks Like
d74b0df Day 5: SDL2 platform layer - window, event loop, green screen render
825c58d Rewrite entire project in Zig
74b8229 feat: add timer subsystem with DIV, TIMA, TMA, TAC registers
7f2c533 feat: add audio subsystem with 4 Game Boy sound channels
9eeca0a feat: add input subsystem with JOY register mapping
d038fa9 feat(cartridge): parse ROM/RAM size from cartridge header
efadf0d feat(cartridge): implement MBC1/MBC5 ROM banking and external RAM
dbf8bc5 feat(cpu): implement full GB instruction set with cycle counting
6ed3419 refactor(memory): delegate IO regions to subsystems
7e1490e refactor(gb): integrate timer, audio, input, cartridge into step loop
4183774 feat(main): add SDL2 keyboard input mapping for Game Boy controls
9dae4de refactor(ppu): clean up debug output and remove duplicate step logic
7dd7980 feat(memory): implement OAM DMA transfer (0xFF46)
9539895 feat(ppu): add sprite/window rendering and register sync
Look at that progression. CPU, memory map, bus delegation, timer, audio, input, cartridge banking, PPU rendering, DMA. Each commit adds one subsystem, and each subsystem breaks the ones before it. The timer needs interrupts. Interrupts need the CPU to support them. The CPU needs to respect IE/IF registers. The memory map needs to route I/O reads to the right subsystem.
You can’t build it all at once. You also can’t build it in isolation.
The Uncomfortable Truth
I learn slower when I build emulators than when I build web apps. But I learn better.
Web development gives you rapid feedback. Change a line, see the result. Emulator development gives you delayed feedback across a 50,000-instruction gap between cause and effect.
Most debugging in low-level systems isn’t fixing code. It’s proving your assumptions are false.
I assumed the carry flag was correct because a test passed. It was. But only for ADD. Not for ADC. Not for SBC. The test suite didn’t cover those cases. The test was wrong. No, the test was incomplete. There’s a difference.
I assumed the timer overflow delay was a quirk I could ignore. instr_timing.gb told me I couldn’t.
I assumed the C to Zig rewrite would solve problems. It changed them.
The emulator still isn’t finished.
Some ROMs fail. Timing is still imperfect. Audio isn’t wired to the speaker.
That’s fine.
I stopped seeing unfinished software as failure.
Unfinished projects are often where the learning lives.
The emulator works enough to prove something: I know more than when I started. And I know enough now to understand how much I still don’t know.
The full source is at github.com/dev-dami/gbemu if you want to see the carnage. Current state: 256 main opcodes + 256 CB opcodes implemented, MBC1/MBC5 banking, full memory map with I/O delegation, scanline-based PPU with background/window/sprite rendering, timer with overflow delay, 4-channel audio (internal state only), OAM DMA, and SDL2 input mapping.