SQLite serializes every writer through a single exclusive lock. The
file-level PENDING/EXCLUSIVE mode is the design choice users hit
first when they scale — two unrelated writers touching disjoint rows
still wait on each other because the lock is page- or file-granularity,
not row-granularity. For workloads where most writes don't actually
conflict, that's throughput left on the table.
SQLRite v0.10.0 lifts that ceiling. The headline shape is straight out of the Hekaton paper and Turso's experimental MVCC, narrowed for SQLRite's single-process scope:
PRAGMA journal_mode = mvcc; -- once per database
BEGIN CONCURRENT;
UPDATE accounts SET balance = balance - 50 WHERE id = 1;
UPDATE accounts SET balance = balance + 50 WHERE id = 2;
COMMIT; -- may return Busy → caller retriesTwo writers on disjoint rows now commit in parallel. Two writers on
the same row see the second commit fail fast with
SQLRiteError::Busy, which the caller retries with a fresh
BEGIN CONCURRENT. The data structure backing this is a per-row
in-memory version chain (MvStore) sitting in front of the existing
pager; the on-disk format is unchanged.
This post walks through the engineering — how the version chain works, what "snapshot isolation" actually means here, why durability needed a new WAL frame kind, how the SDKs got involved, and which parts we deliberately punted on. If you've been reading SQLRite's architecture docs this is the part Phase 11 added; if you haven't, this is a tour of what optimistic MVCC looks like when you build it from scratch in about 4 weeks of focused work.
What "concurrent writes" actually means
The phrase is overloaded. There are at least four things people mean by "concurrent writes in a database":
- Multiple writers, same process. Two threads inside one app each running a write transaction.
- Multiple writers, same machine, different processes. Two instances of a daemon writing to the same file.
- Snapshot-isolated reads. A read transaction sees a consistent point-in-time view of the database, even while writes happen.
- Row-level conflict detection. If two writers touch unrelated rows, neither blocks; if they touch the same row, exactly one wins.
v0.10.0 ships 1, 3, and 4 end-to-end. 2 stays
out-of-scope by design — multi-process MVCC would need a shared-
memory coordination file the way SQLite's WAL does for read marks,
and the marginal payoff is small enough that "use sibling connection
handles inside one process" is the documented escape hatch. Today's
file-level flock(LOCK_EX) still serializes between processes; the
new story is within a process.
That distinction matters for which Rust patterns make sense. If
you're inside one process you can share Arc<Mutex<Database>>
between threads. If you're across processes you need a coordination
medium more complicated than a Mutex, and the engineering bill
goes up by an order of magnitude. SQLRite picks intra-process and
moves on.
The version chain in 30 seconds
For every row that's been touched under BEGIN CONCURRENT, the
engine holds an ordered chain of RowVersions in memory:
begin=ts1 begin=ts3 begin=ts7
end=Some(ts3) end=Some(ts7) end=None
┌────────────┐ ┌────────────┐ ┌────────────┐
rowid 42 ─→ │ balance=100│ ──next──→ │ balance=150│ ──next──→ │ Tombstone │
│ │ │ │ │ (DELETE) │
└────────────┘ └────────────┘ └────────────┘A version is visible to a transaction with begin-timestamp T
when begin <= T < end. That's the textbook snapshot-isolation
visibility rule. New writes push a new head onto the chain at commit
time, capping the previous latest version's end to the new
commit_ts.
Timestamps come from a process-wide logical clock (MvccClock), an
AtomicU64 that hands out begin_ts at BEGIN CONCURRENT and
commit_ts at the start of validation. The clock's high-water mark
is persisted in the WAL header, so a process restart doesn't reuse
timestamps — important because the visibility rule (begin <= T < end)
would mis-classify versions otherwise.
Commit-time validation, not lock-time pessimism
The interesting choice happens at COMMIT, not at BEGIN. Two
writers issue BEGIN CONCURRENT concurrently and neither blocks —
they both proceed against their own private snapshot of the
database. The conflict, if any, is decided when one of them tries
to commit:
- Allocate a
commit_tsfrom the clock. - Walk the write-set. For each
(table, rowid), check whether any committed version'sbegin > tx.begin_ts. If yes, someone else superseded us → returnSQLRiteError::Busy. The transaction is dropped server-side; the caller retries with a freshBEGIN CONCURRENT. - Otherwise, push a new
RowVersiononto each row's chain atcommit_ts, capping the previous latest'send. - Append an
MvccCommitBatchframe to the WAL. - Mirror the writes into the legacy
Database::tablesso the non-concurrent read path stays correct after commit. - Drop the transaction's handle and sweep the write-set's chains for GC.
This is optimistic concurrency control — we don't pay for locks upfront; we pay only when conflicts actually happen. For workloads where most writes are disjoint, that's a strict throughput win over SQLite's "every writer waits for the writer lock" model. For workloads where everyone fights over the same handful of rows, the retry loop is doing the work — and a careful caller might prefer to pre-serialize at the app layer to skip the retry tax.
The plan was always to ship both shapes and let workloads pick.
The retry loop is the whole API
The shape is the same in every language. Here's Rust:
use sqlrite::{Connection, SQLRiteError};
let mut conn = Connection::open("orders.sqlrite")?;
conn.execute("PRAGMA journal_mode = mvcc")?;
loop {
conn.execute("BEGIN CONCURRENT")?;
conn.execute("INSERT INTO orders (id, customer, total) VALUES (1, 'alice', 100)")?;
conn.execute("UPDATE inventory SET stock = stock - 1 WHERE sku = 'WIDGET-A'")?;
match conn.execute("COMMIT") {
Ok(_) => break,
Err(e) if e.is_retryable() => {
conn.execute("ROLLBACK").ok();
continue;
}
Err(e) => return Err(e.into()),
}
}
# Ok::<(), sqlrite::SQLRiteError>(())SQLRiteError::is_retryable() covers both Busy (write-write
conflict at commit) and BusySnapshot (the snapshot the read path
expected has been GC'd). Every SDK surfaces an equivalent
classifier — errors.Is(err, sqlrite.ErrBusy) in Go,
sqlrite.BusyError (subclass of SQLRiteError) in Python,
errorKind(err) === 'Busy' in Node, sqlrite_status_is_retryable
in the C FFI. None of them ship an automatic backoff: the right
policy (immediate retry vs. exponential vs. capped attempts vs.
jittered) depends on the workload, and forcing one would just mean
every caller has to fight the default.
A full runnable example lives at
examples/rust/concurrent_writers.rs.
Two sibling Connections, interleaved BEGIN CONCURRENTs, the
disjoint-row happy path plus the same-row retry. Mostly under 80
lines.
Sibling connections — the SDK plumbing
A single Connection::open is the only call that touches the
file. Additional handles come from Connection::connect(), which
mints a sibling sharing the same Arc<Mutex<Database>>. Every
sibling can hold its own independent BEGIN CONCURRENT — that's
the whole point of multi-handle MVCC.
Every SDK now exposes this:
| SDK | Mint a sibling |
|---|---|
| Rust | let b = primary.connect(); |
| C FFI | sqlrite_connect_sibling(existing, out) |
| Python | conn.connect() |
| Node.js | db.connect() |
| Go | All sql.Open("sqlrite", path) calls for the same canonical path automatically share state through a process-level path registry |
The Go case is the one that ate the most time. Go's
database/sql pool calls driver.Open whenever it wants another
connection slot, and a second sqlrite_open for the same path
would deadlock against the first one's flock(LOCK_EX). The fix is
a tiny in-process registry keyed by filepath.Abs(name): the first
opener pays for a real engine connection, subsequent openers (within
the same pool or across separate *sql.DB instances) mint
siblings off a hidden primary. The registry refcounts; the last
sibling out closes the primary. It's about 80 lines of Go and it
makes the existing errors.Is(err, sqlrite.ErrBusy) machinery
actually exercisable from real Go code.
Durability needed a new WAL frame kind
Phase 4's WAL was per-page: every commit appended frames for
modified pages plus a final commit-barrier frame with the new page
count. That's perfect for the legacy single-writer path — COMMIT
fsyncs the barrier frame and the transaction is durable.
The MVCC commit path mirrors writes into Database::tables so the
legacy save still happens, so technically the visible row state is
durable through the existing machinery. But the MvStore itself —
the version chain that powers conflict detection — lives only in
memory. Without persistence the conflict-detection window doesn't
survive a process restart: a second process could legitimately hand
out a begin_ts below an already-committed version's end, and
the visibility rule would mis-classify one side.
Phase 11.9 closes that gap with a typed MvccCommitBatch frame,
distinguished from page frames by the sentinel page_num = u32::MAX
(real page numbers are bounded by file size; no collision risk).
The frame body encodes the commit timestamp plus a record stream of
the resolved write-set:
┌────────┬────────┬─────────────────────────────────────────────────┐
│ offset │ length │ content │
├────────┼────────┼─────────────────────────────────────────────────┤
│ 0 │ 8 │ magic "MVCC0001" │
│ 8 │ 8 │ commit_ts (u64 LE) │
│ 16 │ 2 │ record count (u16 LE) │
│ 18 │ var. │ per-record: op tag, table name, rowid, payload │
│ ... │ ... │ zero-padded to PAGE_SIZE │
└────────┴────────┴─────────────────────────────────────────────────┘The frame is appended without its own fsync — the very next legacy
commit frame from the same save_database is fsync'd, and that
flushes everything in between. So a single fsync covers both the
MVCC frame and the page-level updates. A crash between the two
appends drops both — torn-write atomicity for the whole transaction,
the same property the per-page WAL already had.
On reopen, the WAL replay walks every MVCC frame and re-pushes the
versions into MvStore via the same push_committed the live
commit path uses. The MvccClock is seeded past
max(WAL header clock_high_water, max(commit_ts in replayed frames)) —
the max is what keeps things correct between checkpoints, since the
header is only fsync'd at checkpoint time and the frame timestamps
are durable on every commit.
WAL format goes v1 → v3 (v2 added the clock high-water; v3 added the MVCC frame marker). Decoders accept all three, so v0.10.0 reads v0.9.1's files unchanged.
The REPL is the demo
sqlrite, the REPL binary, used to hold a single &mut Database.
v0.10.0 lifts it to Vec<Connection> so users can mint sibling
handles in-session. The prompt always shows the active handle:
sqlrite[A]> PRAGMA journal_mode = mvcc;
sqlrite[A]> CREATE TABLE t (id INTEGER PRIMARY KEY, v INTEGER);
sqlrite[A]> INSERT INTO t (id, v) VALUES (1, 0);
sqlrite[A]> .spawn
Spawned sibling handle 'B' and switched to it. 2 handles open.
sqlrite[B]> .use A
sqlrite[A]> BEGIN CONCURRENT;
sqlrite[A]> UPDATE t SET v = 100 WHERE id = 1;
sqlrite[A]> .conns
2 handle(s):
* A (BEGIN CONCURRENT)
B
sqlrite[A]> .use B
sqlrite[B]> BEGIN CONCURRENT;
sqlrite[B]> UPDATE t SET v = 200 WHERE id = 1;
sqlrite[B]> COMMIT;
sqlrite[B]> .use A
sqlrite[A]> COMMIT;
An error occured: Busy: write-write conflict on t/1: another transaction
committed this row at ts=3 (after our begin_ts=1); transaction rolled
back, retry with a fresh BEGIN CONCURRENT
sqlrite[A]> .use B
sqlrite[B]> SELECT * FROM t;
+----+-----+
| id | v |
+----+-----+
| 1 | 200 |
+----+-----+.spawn mints a sibling. .use NAME switches the active handle.
.conns lists every handle, marks the active one, and flags any
holding an open BEGIN CONCURRENT. The whole multi-handle MVCC
story is reachable from a single binary, no external orchestration,
no Docker compose.
What we left out, on purpose
Three things stayed deliberately out of v0.10.0:
Indexes under MVCC — Turso explicitly punted on this in their
own MVCC work, and we did too. Each secondary-index entry under
MVCC would need its own RowVersion, keyed by (index_id, key, rowid) —
one version chain per indexed (column, row) pair. The memory and
GC costs are non-trivial. The engine currently rejects CREATE INDEX
while journal_mode = mvcc; with a typed error. We'll tackle
indexes-under-MVCC as its own follow-up phase once the v0 is stable.
Checkpoint drain — The checkpointer doesn't yet fold MvStore
versions back into pager-level updates. As a result,
set_journal_mode(Mvcc → Wal) is rejected if the store carries any
committed versions (would silently strand them). The MVCC frames in
the WAL still provide durability, and the per-commit GC bounds
memory growth for normal workloads; but a clean Mvcc → Wal downgrade
is parked.
Cross-process MVCC — Mentioned earlier. The in-memory MvStore
has no cross-process visibility; multi-process writers still
serialize through flock(LOCK_EX). SQLite's WAL coordination uses
a shared-memory file for read marks; we could go there, but the
intra-process story covers the workloads we actually care about.
All three are tracked in the repo's roadmap and as separate work items.
What it took
Phase 11 was ten merged sub-phases plus a docs sweep:
- 11.1 —
Connectionbecomes a thin handle overArc<Mutex<Database>> - 11.2 — Logical clock + active-tx registry; WAL header v1 → v2
- 11.3 —
MvStoreskeleton +PRAGMA journal_modeopt-in - 11.4 —
BEGIN CONCURRENTwrites + commit-time validation - 11.5 — Snapshot-isolated reads via
Statement::query - 11.6 — Per-commit GC +
Connection::vacuum_mvcc - 11.7 — SDK propagation of Busy/BusySnapshot across C, Python, Node, Go
- 11.8 — Sibling connection handles in the FFI and Python/Node bindings
- 11.9 — WAL log-record durability + crash recovery; WAL format v3
- 11.11a — REPL
.spawn/.use/.conns - 11.11b — New
W13bench workload (4 workers × 50 BEGIN/UPDATE/COMMIT) - 11.11c — Go SDK cross-pool sibling path registry
- 11.12 — Canonical
docs/concurrent-writes.md- worked example + roadmap cleanup
The full design rationale lives in
docs/concurrent-writes-plan.md;
the user-facing reference is docs/concurrent-writes.md.
Each sub-phase was one PR, one review, one merge. Phase numbering is
real and the roadmap is the single source of truth — it's how you
keep "MVCC" from sprawling from an estimate into an engineering
sinkhole.
Try it
# Rust
cargo add sqlrite-engine # v0.10.0
# Python
pip install sqlrite # v0.10.0
# Node
npm install @joaoh82/sqlrite # v0.10.0
# Go
go get github.com/joaoh82/rust_sqlite/sdk/go@latest
# REPL
cargo install sqlrite-engine
sqlrite some/path/to/db.sqlriteThen PRAGMA journal_mode = mvcc;, BEGIN CONCURRENT;, and you're
in. The canonical reference is at
docs/concurrent-writes.md;
the worked retry-loop example is at
examples/rust/concurrent_writers.rs;
the design rationale is at
docs/concurrent-writes-plan.md.
If you build something on top of it, I want to hear about it — open an issue, join the Discord, or just publish a post. SQLRite's whole premise is "implement the parts of SQLite that matter, in the open, so the codebase is the textbook." Phase 11 was the chapter on MVCC. Whatever you build with it teaches the rest.