Skip to Content
ArchitectureManifests

Manifests

KalamDB uses a manifest as the authoritative index for cold Parquet segments. Every manifest is owned by the manifest service, which decides where the current copy lives, when it is dirty, and when storage manifest.json is rewritten.

Manifests exist behind the hot/cold flow described in /docs/server/architecture/storage-tiers. They apply to USER and SHARED cold-storage scopes; see /docs/server/architecture/table-types for the table-type boundary.

Manifest Tiers

Table typeHot manifest pathCold manifest path
SHAREDprocess memory -> system.manifest in RocksDBmanifest.json in table storage
USERsystem.manifest in RocksDBmanifest.json in the user storage path

This split is intentional:

  • SHARED manifests stay in process memory because they are low-cardinality and widely reused.
  • USER manifests stay out of process memory so millions of user-scoped manifests do not expand server RSS.
  • Both table types still persist their durable hot copy in RocksDB before storage writes.

The canonical read path is ManifestService::get_or_load():

  1. check the fastest hot tier for the scope
  2. fall back to RocksDB manifest copy
  3. fall back to storage manifest.json
  4. hydrate faster tiers above the answering layer

Control Flow

What Happens On Normal Writes

For normal DML and flush-preparation updates, KalamDB does not rewrite manifest.json on every change.

  1. The write path updates hot data in RocksDB.
  2. The manifest service updates manifest metadata in the hot tier.
  3. The manifest is marked pending_write.
  4. Cold storage is updated later by the flush path.

That means manifest bookkeeping stays cheap on the write path while still keeping the latest authoritative manifest state available for reads, flush planning, and primary-key checks.

The hot manifest entry also tracks a sync state:

  • in_sync
  • pending_write
  • syncing
  • stale
  • error

During a flush, the scope becomes syncing while the Parquet temp file is being written.

When manifest.json Is Actually Written

manifest.json is persisted when KalamDB commits cold-storage state, not on every mutation.

  • Automatic or manual flushes write Parquet, update manifest segment metadata, and then persist manifest.json.
  • Reloads after cache misses can hydrate the hot tier from the stored manifest.
  • Vector-index DDL metadata is a deliberate exception: it is persisted immediately as a control-plane metadata write so vector state is not lost when there is no row flush to trigger storage persistence.

Flush commits a new segment through one canonical path: append the segment metadata, write manifest.json, then refresh RocksDB and shared-scope memory cache.

Flush-Time Segment Append

For a normal cold append, KalamDB writes batch-N.parquet.tmp, renames it to batch-N.parquet, then appends a new segment entry that includes:

  • min_seq and max_seq
  • row_count
  • size_bytes
  • schema_version
  • status
  • column_stats for indexed columns

last_sequence_number tracks the latest batch-N slot used for future flush naming.

How Manifest Compaction Works

Post-flush small-segment compaction does not rebuild the whole manifest. It compactly rewrites only the trailing small-file suffix.

The compactor:

  1. selects the trailing run of small, committed, same-schema segments
  2. rewrites the newest MVCC winners into compact-<uuid>.parquet
  3. reacquires the manifest scope lock
  4. verifies the selected inputs are still the current trailing suffix
  5. truncates that suffix and appends the compacted replacement segment, or appends nothing if the suffix was fully pruned
  6. persists the updated manifest.json

If another flush changed the manifest tail while compaction was running, the swap is skipped and the compacted file is deleted.

Compaction filenames do not use the batch-N numbering scheme, so they do not consume a new batch sequence slot.

What A Real manifest.json Looks Like

The example below is taken from a real user-table manifest written by the KalamDB backend. The values vary by table and flush, but the field names and nesting match the on-disk JSON shape.

json snippetJSON
{  "table_id": "flush_manifest_ns_mpczl3q7_s3k_0.user_flush_test_mpczl3q7_s3k_0",  "user_id": "admin",  "version": 2,  "created_at": 1779216537052,  "updated_at": 1779216537063,  "segments": [    {      "min_seq": 315199164996489218,      "max_seq": 315199165030043648,      "row_count": 20,      "size_bytes": 2711,      "created_at": 1779216537063,      "id": "batch-0.parquet",      "path": "batch-0.parquet",      "column_stats": {        "1": {          "min": { "Int64": "315199164994551810" },          "max": { "Int64": "315199165028106240" },          "null_count": 0        }      },      "schema_version": 1,      "status": "committed"    }  ],  "last_sequence_number": 0,  "files": null,  "vector_indexes": {}}

What these fields mean in practice:

  • segments is the durable list of cold Parquet batches for that exact table scope.
  • column_stats is keyed by stable column ID, not by column name, so renames do not invalidate old segment metadata.
  • last_sequence_number tracks the last batch-N.parquet slot used for segment file naming.
  • files stays null until the table enables FILE-column subfolder tracking.
  • vector_indexes stays empty until the table has vector index metadata to persist.

Why Primary-Key Checks Use The Manifest Service

Primary-key existence and cold-segment pruning go through the manifest service instead of reopening manifest.json on every check.

That gives KalamDB one source of truth for:

  • the freshest shared-table manifest already in memory
  • the freshest user-table manifest already in RocksDB
  • fallback reload from storage only when the hot tier does not have the manifest yet

This avoids repeated storage reads and keeps pruning decisions aligned with the same manifest state the flush path updates.

It also means manifest-aware reads see the same suffix-replacement result after compaction that the flush and compaction jobs committed.

Shared vs User Storage Layout

A SHARED table writes one manifest per table path:

text snippetTEXT
storage/<namespace>/<table>/manifest.json

A USER table writes one manifest per user-scoped table path:

text snippetTEXT
storage/<namespace>/<table>/<user_id>/manifest.json

Each manifest tracks the Parquet segments for that exact scope only.

Last updated on