How SQLRite stores rows on disk: pages, B-trees, and a diff-based pager

A walkthrough of SQLRite's on-disk format — 4 KiB pages, cell-encoded B-trees per table and index, and a write-ahead log that only persists the pages that actually changed.

The first time you write a database from scratch, the part you don't expect to spend the most time on is the file. Not the parser, not the optimizer, not joins — the file. The shape of the bytes on disk constrains every other decision you make for the rest of the project's life. Get it wrong early and every later feature is a tax.

This post is a walkthrough of how SQLRite currently stores rows on disk. It covers the four things that, together, are the whole storage subsystem:

  1. The page — a fixed-size unit on disk.
  2. The cell — how a single row is encoded into a page.
  3. The B-tree — how cells across many pages are organized.
  4. The pager and the WAL — how changes get from RAM to disk without corrupting anything when the lights go out.

If you are coming from "I read about B-trees once" and you want to see what the implementation actually looks like in a working embedded database, this is the post for you. The user-level surface — the SQL this engine accepts, the SDKs that wrap it — lives over in the docs; we'll only refer to it where it matters.

Pages

A SQLRite database is a single file. The file is divided into contiguous 4 KiB pages, numbered starting at 1. Page 0 is a header and Page 1 is the schema page; everything else is allocated on demand to tables and indexes.

Why 4 KiB?

A page in SQLRite has a header (page type, free byte count, slot directory size, and parent/right-sibling pointers if it's a B-tree node) and a body. The body grows from the bottom of the page upward as cells; a slot directory at the top of the body grows downward. This is the classic slotted page layout that most disk-resident databases settle on, for the same reason everyone settles on it: it makes variable-length rows easy to insert, delete, and reorder without rewriting the whole page.

Cells

A "cell" is what SQLRite calls one row encoded into bytes. The encoding is very simple by design:

What we deliberately do not do (yet) is the SQLite varint trick where each integer is encoded with the smallest number of bytes that will hold it. SQLite gets a real space win out of that — most integers in a real database are small. SQLRite uses fixed-width encoding for now because the win is modest at our scale and the debugging cost of a malformed varint is not modest. We will likely add it once the page format hits v5.

A row that's larger than will fit in a page — for example, a long TEXT blob or a 1024-dimensional VECTOR — spills into an overflow chain. The first part stays on the page, with a pointer to a continuation page. Continuations chain until the value ends. This is also the SQLite playbook; it is roughly the only known good way to keep a fixed page size while supporting arbitrarily large values.

B-trees

Every SQLRite table is a B-tree keyed on its INTEGER PRIMARY KEY (or an implicit row ID if the schema doesn't declare one). Every SQLRite index is a B-tree keyed on the indexed columns. The B-trees share the same on-disk node layout — leaf vs. internal — so the pager doesn't need to know whether it's looking at a table page or an index page. That's a small implementation win that pays off every time you go to write a new feature.

Two implementation decisions are worth calling out, because they are not the obvious ones:

Bottom-up rebuild on commit

Most database tutorials describe B-tree mutations as in-place edits: insert a key, find the right leaf, split the leaf if it overflows, propagate the split up the parent chain. SQLRite does this in memory during a transaction, against a snapshot of the in-memory state — but at commit time, instead of writing back the edited tree page-by-page, the entire B-tree is rebuilt bottom-up.

That sounds expensive. It mostly isn't. The leaf pages already exist in memory; rebuilding is a serialization pass and a couple of slot directory tweaks. The reason to do it this way is that bottom-up rebuild is correct by construction. Every page that gets persisted is a fresh, valid page, with no "did we update the parent before the crash?" failure mode. It also gives us a much simpler diffing pass for the WAL (more on that next).

The cost we are paying for this simplicity is O(N) rather than O(log N) per commit on a hot table. That ceiling will eventually need to come down — probably with an in-place split path — but it is a ceiling I'd rather raise after I have the rest of the system, not before.

Slot directory, not a sorted array

A leaf page's cells are not stored in key order in the page body. They're stored in arrival order. A separate slot directory lists cell offsets in key order. This means an insert never has to memmove the page body — only update the slot directory. It also means lookups are still O(log n_cells) on the slot directory.

Index pages use the same trick. Same wins.

The pager

The pager is the thing in the middle: it sits between the in-memory B-trees and the file. Its job is to:

That last clause is what the WAL is for.

The diff is the trick

A naïve pager writes every page that changed during the transaction, in order, and then fsyncs. That works. It is also a lot of writes. SQLRite's pager only writes pages whose bytes actually changed. During a write transaction, the pager hashes each candidate dirty page against its on-disk copy. If the hash matches, the page isn't dirty after all — somebody touched the in-memory copy but didn't change anything that affects the persisted bytes.

This is the diff-based pager. In practice, on workloads with read-modify-write cycles or UPDATE … WHERE queries that match nothing, it cuts disk writes dramatically. On a clean CREATE TABLE + INSERT path, it's a no-op — every page is genuinely new.

The WAL

The write-ahead log is a sidecar file: <db>.sqlrite-wal. It exists only while writes are in flight; it is truncated on a clean checkpoint.

Each commit appends one frame per dirty page. A frame is the page's full 4 KiB body, plus a small header carrying the page number and a per-frame checksum. The very last frame in a commit carries the new page-0 contents — the database header — and is marked as a commit frame.

On crash recovery, the rules are:

The contract is: a commit either landed in the WAL completely, or it didn't. There is no half-committed transaction.

Auto-checkpointing

A WAL that grows forever is a problem. SQLRite auto-checkpoints once the WAL crosses 100 frames. A checkpoint is straightforward:

  1. Acquire the writer lock (shared with the rest of the engine via fs2 advisory locks).
  2. For each dirty page in the WAL, in WAL order, write it to its home offset in the main file.
  3. fsync the main file.
  4. Truncate the WAL to zero bytes.
  5. Drop the lock.

Readers using shared locks can keep reading from the WAL while the checkpoint runs; they'll see the WAL's view of any pages that haven't been pushed to the main file yet.

Crash safety, in three sentences

Every page write is preceded by a WAL frame. Every commit terminates with a special frame that validates the entire batch. On reopen, the WAL is the source of truth until checkpointed.

That's it. That's the whole crash-safety story.

Putting it together

Here is what the path looks like when you do a single INSERT:

  1. Parse the SQL into an AST. Bind it to a typed InsertQuery.
  2. The executor takes the in-memory snapshot of the table (a BTree), inserts the row into a leaf, splits if needed, walks parents up.
  3. On COMMIT (or auto-commit, for a bare INSERT):
    • The pager finds the dirty pages of the rebuilt B-tree.
    • It diffs each one against its on-disk copy and discards no-op writes.
    • It encodes the survivors as WAL frames, computing checksums.
    • It writes all frames, then a final commit frame carrying the new page-0.
    • It fsyncs the WAL.
  4. The auto-checkpointer fires later if the WAL is long enough.

The contract that sits over all of this: a COMMIT returns success only after the commit frame is durably on disk. That's the hard part.

What I'd do differently

A few decisions I'd revisit:

But I'd keep the file format. The shape of bytes on disk is the hardest thing to change later, and the shape SQLRite settled on is boring in all the right ways.

If you want to read the actual code, the page and B-tree types live in src/sql/pager/ and the executor's commit path lives in src/sql/db/database.rs. The full file format spec is at docs/file-format.md.

The next post in the series is the most fun one to write so far: adding vector search to a SQLite-style database with HNSW. That's where embedded SQL stops looking like 1992 and starts looking like the next decade.

More on the philosophy behind these choices in the origin-story post; a head-to-head against rusqlite-bundled SQLite on storage-heavy workloads is in the benchmarks post. The transactions and persistence section of the docs covers the same ground from the user side.

If SQLRite has been useful to you, ⭐ the repo — visibility is the bottleneck.