Manifests

KalamDB uses a manifest as the authoritative index for cold Parquet segments. Every manifest is owned by the manifest service, which decides where the current copy lives, when it is dirty, and when storage manifest.json is rewritten.

Manifests exist behind the hot/cold flow described in /docs/server/architecture/storage-tiers. They apply to USER and SHARED cold-storage scopes; see /docs/server/architecture/table-types for the table-type boundary.

Manifest Tiers

Table type	Hot manifest path	Cold manifest path
`SHARED`	process memory -> `system.manifest` in RocksDB	`manifest.json` in table storage
`USER`	`system.manifest` in RocksDB	`manifest.json` in the user storage path

This split is intentional:

SHARED manifests stay in process memory because they are low-cardinality and widely reused.
USER manifests stay out of process memory so millions of user-scoped manifests do not expand server RSS.
Both table types still persist their durable hot copy in RocksDB before storage writes.

The canonical read path is ManifestService::get_or_load():

check the fastest hot tier for the scope
fall back to RocksDB manifest copy
fall back to storage manifest.json
hydrate faster tiers above the answering layer

Control Flow

What Happens On Normal Writes

For normal DML and flush-preparation updates, KalamDB does not rewrite manifest.json on every change.

The write path updates hot data in RocksDB.
The manifest service updates manifest metadata in the hot tier.
The manifest is marked pending_write.
Cold storage is updated later by the flush path.

That means manifest bookkeeping stays cheap on the write path while still keeping the latest authoritative manifest state available for reads, flush planning, and primary-key checks.

The hot manifest entry also tracks a sync state:

in_sync
pending_write
syncing
stale
error

During a flush, the scope becomes syncing while the Parquet temp file is being written.

When `manifest.json` Is Actually Written

manifest.json is persisted when KalamDB commits cold-storage state, not on every mutation.

Automatic or manual flushes write Parquet, update manifest segment metadata, and then persist manifest.json.
Reloads after cache misses can hydrate the hot tier from the stored manifest.
Vector-index DDL metadata is a deliberate exception: it is persisted immediately as a control-plane metadata write so vector state is not lost when there is no row flush to trigger storage persistence.

Flush commits a new segment through one canonical path: append the segment metadata, write manifest.json, then refresh RocksDB and shared-scope memory cache.

Flush-Time Segment Append

For a normal cold append, KalamDB writes batch-N.parquet.tmp, renames it to batch-N.parquet, then appends a new segment entry that includes:

min_seq and max_seq
row_count
size_bytes
schema_version
status
column_stats for indexed columns

last_sequence_number tracks the latest batch-N slot used for future flush naming.

How Manifest Compaction Works

Post-flush small-segment compaction does not rebuild the whole manifest. It compactly rewrites only the trailing small-file suffix.

The compactor:

selects the trailing run of small, committed, same-schema segments
rewrites the newest MVCC winners into compact-<uuid>.parquet
reacquires the manifest scope lock
verifies the selected inputs are still the current trailing suffix
truncates that suffix and appends the compacted replacement segment, or appends nothing if the suffix was fully pruned
persists the updated manifest.json

If another flush changed the manifest tail while compaction was running, the swap is skipped and the compacted file is deleted.

Compaction filenames do not use the batch-N numbering scheme, so they do not consume a new batch sequence slot.

What A Real `manifest.json` Looks Like

The example below is taken from a real user-table manifest written by the KalamDB backend. The values vary by table and flush, but the field names and nesting match the on-disk JSON shape.

JSON

1{2  "table_id": "flush_manifest_ns_mpczl3q7_s3k_0.user_flush_test_mpczl3q7_s3k_0",3  "user_id": "admin",4  "version": 2,5  "created_at": 1779216537052,6  "updated_at": 1779216537063,7  "segments": [8    {9      "min_seq": 315199164996489218,10      "max_seq": 315199165030043648,11      "row_count": 20,12      "size_bytes": 2711,13      "created_at": 1779216537063,14      "id": "batch-0.parquet",15      "path": "batch-0.parquet",16      "column_stats": {17        "1": {18          "min": { "Int64": "315199164994551810" },19          "max": { "Int64": "315199165028106240" },20          "null_count": 021        }22      },23      "schema_version": 1,24      "status": "committed"25    }26  ],27  "last_sequence_number": 0,28  "files": null,29  "vector_indexes": {}30}

What these fields mean in practice:

segments is the durable list of cold Parquet batches for that exact table scope.
column_stats is keyed by stable column ID, not by column name, so renames do not invalidate old segment metadata.
last_sequence_number tracks the last batch-N.parquet slot used for segment file naming.
files stays null until the table enables FILE-column subfolder tracking.
vector_indexes stays empty until the table has vector index metadata to persist.

Why Primary-Key Checks Use The Manifest Service

Primary-key existence and cold-segment pruning go through the manifest service instead of reopening manifest.json on every check.

That gives KalamDB one source of truth for:

the freshest shared-table manifest already in memory
the freshest user-table manifest already in RocksDB
fallback reload from storage only when the hot tier does not have the manifest yet

This avoids repeated storage reads and keeps pruning decisions aligned with the same manifest state the flush path updates.

It also means manifest-aware reads see the same suffix-replacement result after compaction that the flush and compaction jobs committed.

Shared vs User Storage Layout

A SHARED table writes one manifest per table path:

TEXT

1storage/<namespace>/<table>/manifest.json

A USER table writes one manifest per user-scoped table path: