Storage Tiers

KalamDB uses a dual-tier storage architecture that keeps the write path hot in RocksDB while moving queryable historical state into immutable Parquet segments.

This page describes how USER and SHARED tables move from hot to cold storage. For the table-type rules that decide whether a table can flush at all, see /docs/server/architecture/table-types.

Hot Tier (RocksDB)

The hot tier handles incoming writes in RocksDB-backed table stores.

Characteristics:

⚡ Sub-millisecond write latency
Optimized for point lookups, recent writes, and write-heavy workloads
Holds the newest live rows and the newest delete tombstones
Feeds the flush scheduler through manifest scopes marked pending_write

Cold Tier (Parquet)

Flushed data is written to Apache Parquet files for efficient scans, pruning, and long-term storage.

Characteristics:

📊 Columnar format for efficient analytics
High compression ratios
Each segment tracked in manifest.json and a hot manifest copy
Supports multiple storage backends (local, S3, Azure, GCS)

Parquet Compression

Table COMPRESSION is a cold-tier Parquet setting for USER and SHARED tables. KalamDB currently supports three values:

Value	Meaning	Typical use
`none`	Writes uncompressed Parquet pages.	Debugging, CPU-constrained flushes, or data already compressed at the value level
`snappy`	Fast Parquet compression and decompression with low CPU overhead. This is the default.	General workloads and latency-sensitive flushes
`zstd`	Zstandard Parquet compression at level 1 for better storage density with modest extra CPU.	Larger historical datasets or storage-cost-sensitive tables

The selected codec is applied when USER and SHARED table scopes write batch-N.parquet.tmp and when tail compaction writes compact-<uuid>.parquet.tmp. After the atomic rename, readers discover the codec from the Parquet footer, so no manifest-side decompression setting is needed. This setting does not affect RocksDB hot storage, WebSocket gzip, backups, or stream table log files. STREAM tables do not accept table Parquet compression because current stream storage does not flush to Parquet cold segments.

End-to-End Tier Flow

Flush Policy

Tables are configured with a flush policy that determines when data moves from hot to cold tier:

SQL

1CREATE TABLE app.messages (2  id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(),3  content TEXT NOT NULL,4  created_at TIMESTAMP DEFAULT NOW()5) WITH (6  TYPE = 'USER',7  FLUSH_POLICY = 'rows:1000,interval:60'8);

Policy	Description
`rows:N`	Flush after N rows accumulated
`interval:N`	Flush every N seconds
`rows:N,interval:N`	Flush on whichever threshold is hit first

How Flushing Works (Current Engine Path)

The flush flow is job-driven and crash-safe:

DML writes land in RocksDB first.
The manifest scope is marked pending_write.
STORAGE FLUSH TABLE or STORAGE FLUSH ALL creates background flush jobs in system.jobs.
The flush executor resolves the target scope and scans hot rows in bounded batches.
The latest _seq wins per primary key. Latest tombstones stay hot instead of being written to Parquet.
FlushScopeWriter writes batch-N.parquet.tmp.
The temp file is atomically renamed to batch-N.parquet.
Manifest metadata for the new segment is persisted through the manifest service.
Flushed hot rows are deleted from RocksDB in bounded cleanup batches.

Notes:

STREAM and SYSTEM tables are not part of this flush-to-Parquet path.
STORAGE FLUSH is asynchronous; monitor progress via system.jobs.
Shared scopes flush once per table scope. User scopes flush once per (table, user) scope.

Flush State Machine

Post-Flush Small-Segment Compaction

KalamDB now has an optional post-flush tail-compaction path under [flush.compaction]:

TOML

1[flush.compaction]2enabled = false3min_eligible_segments = 54max_segments_per_run = 85user_max_segment_rows = 100006shared_max_segment_rows = 25000

This path is separate from the flush critical section. After a successful flush, the leader may enqueue a segment_compact job for the scopes that actually wrote Parquet files.

The compactor:

walks the manifest from newest to oldest
selects only the trailing run of small, committed, same-schema segments
stops when it hits an already-large segment, a non-readable segment, or a schema-version boundary
rewrites only the latest MVCC winners into compact-<uuid>.parquet
swaps the manifest tail only if the selected suffix is still unchanged

If a newer flush changed the manifest tail while compaction was running, the compacted output is discarded and the existing manifest remains authoritative.

Only USER and SHARED tables participate in this mechanism.

Manual Flush & Compaction

SQL

1-- Flush a specific table2STORAGE FLUSH TABLE myapp.messages;3 4-- Flush all tables in a namespace5STORAGE FLUSH ALL IN myapp;6 7-- Compact cold storage8STORAGE COMPACT TABLE myapp.messages;9 10-- Check storage health11STORAGE CHECK local EXTENDED;

S3 User-Data Example

Create an S3 storage and bind a USER table to it:

For MinIO-compatible local S3 setup, see /docs/server/integrations/minio.

SQL

1CREATE STORAGE s3_prod2  TYPE 's3'3  BUCKET 'my-kalamdb-prod'4  REGION 'us-east-1'5  USER_TABLES_TEMPLATE 'users/{namespace}/{tableName}/{userId}'6  SHARED_TABLES_TEMPLATE 'shared/{namespace}/{tableName}';7 8CREATE TABLE chat.messages (9  id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(),10  conversation_id BIGINT NOT NULL,11  content TEXT NOT NULL,12  created_at TIMESTAMP DEFAULT NOW()13) WITH (14  TYPE = 'USER',15  STORAGE_ID = 's3_prod',16  FLUSH_POLICY = 'rows:1000,interval:60'17);

For user u_42, cold-tier objects are stored under the user template:

TEXT

1s3://my-kalamdb-prod/users/chat/messages/u_42/manifest.json2s3://my-kalamdb-prod/users/chat/messages/u_42/batch-0.parquet3s3://my-kalamdb-prod/users/chat/messages/u_42/batch-1.parquet

Illustrative manifest excerpt:

JSON

1{2  "segments": [3    {4      "id": "batch-1.parquet",5      "path": "batch-1.parquet",6      "row_count": 1000,7      "size_bytes": 184320,8      "schema_version": 1,9      "status": "committed"10    }11  ],12  "last_sequence_number": 113}

Per-User Storage Isolation