Manifests
KalamDB uses a manifest as the authoritative index for cold Parquet segments. Every manifest is
owned by the manifest service, which decides where the current copy lives, when it is dirty, and
when storage manifest.json is rewritten.
Manifests exist behind the hot/cold flow described in
/docs/server/architecture/storage-tiers. They apply to
USER and SHARED cold-storage scopes; see
/docs/server/architecture/table-types for the table-type
boundary.
Manifest Tiers
| Table type | Hot manifest path | Cold manifest path |
|---|---|---|
SHARED | process memory -> system.manifest in RocksDB | manifest.json in table storage |
USER | system.manifest in RocksDB | manifest.json in the user storage path |
This split is intentional:
SHAREDmanifests stay in process memory because they are low-cardinality and widely reused.USERmanifests stay out of process memory so millions of user-scoped manifests do not expand server RSS.- Both table types still persist their durable hot copy in RocksDB before storage writes.
The canonical read path is ManifestService::get_or_load():
- check the fastest hot tier for the scope
- fall back to RocksDB manifest copy
- fall back to storage
manifest.json - hydrate faster tiers above the answering layer
Control Flow
What Happens On Normal Writes
For normal DML and flush-preparation updates, KalamDB does not rewrite manifest.json on every change.
- The write path updates hot data in RocksDB.
- The manifest service updates manifest metadata in the hot tier.
- The manifest is marked
pending_write. - Cold storage is updated later by the flush path.
That means manifest bookkeeping stays cheap on the write path while still keeping the latest authoritative manifest state available for reads, flush planning, and primary-key checks.
The hot manifest entry also tracks a sync state:
in_syncpending_writesyncingstaleerror
During a flush, the scope becomes syncing while the Parquet temp file is being written.
When manifest.json Is Actually Written
manifest.json is persisted when KalamDB commits cold-storage state, not on every mutation.
- Automatic or manual flushes write Parquet, update manifest segment metadata, and then persist
manifest.json. - Reloads after cache misses can hydrate the hot tier from the stored manifest.
- Vector-index DDL metadata is a deliberate exception: it is persisted immediately as a control-plane metadata write so vector state is not lost when there is no row flush to trigger storage persistence.
Flush commits a new segment through one canonical path: append the segment metadata, write
manifest.json, then refresh RocksDB and shared-scope memory cache.
Flush-Time Segment Append
For a normal cold append, KalamDB writes batch-N.parquet.tmp, renames it to batch-N.parquet,
then appends a new segment entry that includes:
min_seqandmax_seqrow_countsize_bytesschema_versionstatuscolumn_statsfor indexed columns
last_sequence_number tracks the latest batch-N slot used for future flush naming.
How Manifest Compaction Works
Post-flush small-segment compaction does not rebuild the whole manifest. It compactly rewrites only the trailing small-file suffix.
The compactor:
- selects the trailing run of small, committed, same-schema segments
- rewrites the newest MVCC winners into
compact-<uuid>.parquet - reacquires the manifest scope lock
- verifies the selected inputs are still the current trailing suffix
- truncates that suffix and appends the compacted replacement segment, or appends nothing if the suffix was fully pruned
- persists the updated
manifest.json
If another flush changed the manifest tail while compaction was running, the swap is skipped and the compacted file is deleted.
Compaction filenames do not use the batch-N numbering scheme, so they do not consume a new batch
sequence slot.
What A Real manifest.json Looks Like
The example below is taken from a real user-table manifest written by the KalamDB backend. The values vary by table and flush, but the field names and nesting match the on-disk JSON shape.
{ "table_id": "flush_manifest_ns_mpczl3q7_s3k_0.user_flush_test_mpczl3q7_s3k_0", "user_id": "admin", "version": 2, "created_at": 1779216537052, "updated_at": 1779216537063, "segments": [ { "min_seq": 315199164996489218, "max_seq": 315199165030043648, "row_count": 20, "size_bytes": 2711, "created_at": 1779216537063, "id": "batch-0.parquet", "path": "batch-0.parquet", "column_stats": { "1": { "min": { "Int64": "315199164994551810" }, "max": { "Int64": "315199165028106240" }, "null_count": 0 } }, "schema_version": 1, "status": "committed" } ], "last_sequence_number": 0, "files": null, "vector_indexes": {}}What these fields mean in practice:
segmentsis the durable list of cold Parquet batches for that exact table scope.column_statsis keyed by stable column ID, not by column name, so renames do not invalidate old segment metadata.last_sequence_numbertracks the lastbatch-N.parquetslot used for segment file naming.filesstaysnulluntil the table enables FILE-column subfolder tracking.vector_indexesstays empty until the table has vector index metadata to persist.
Why Primary-Key Checks Use The Manifest Service
Primary-key existence and cold-segment pruning go through the manifest service instead of reopening manifest.json on every check.
That gives KalamDB one source of truth for:
- the freshest shared-table manifest already in memory
- the freshest user-table manifest already in RocksDB
- fallback reload from storage only when the hot tier does not have the manifest yet
This avoids repeated storage reads and keeps pruning decisions aligned with the same manifest state the flush path updates.
It also means manifest-aware reads see the same suffix-replacement result after compaction that the flush and compaction jobs committed.
Shared vs User Storage Layout
A SHARED table writes one manifest per table path:
storage/<namespace>/<table>/manifest.jsonA USER table writes one manifest per user-scoped table path:
storage/<namespace>/<table>/<user_id>/manifest.jsonEach manifest tracks the Parquet segments for that exact scope only.