Storage Tiers
KalamDB uses a dual-tier storage architecture that keeps the write path hot in RocksDB while moving queryable historical state into immutable Parquet segments.
This page describes how USER and SHARED tables move from hot to cold storage. For the table-type
rules that decide whether a table can flush at all, see
/docs/server/architecture/table-types.
Hot Tier (RocksDB)
The hot tier handles incoming writes in RocksDB-backed table stores.
Characteristics:
- ⚡ Sub-millisecond write latency
- Optimized for point lookups, recent writes, and write-heavy workloads
- Holds the newest live rows and the newest delete tombstones
- Feeds the flush scheduler through manifest scopes marked
pending_write
Cold Tier (Parquet)
Flushed data is written to Apache Parquet files for efficient scans, pruning, and long-term storage.
Characteristics:
- 📊 Columnar format for efficient analytics
- High compression ratios
- Each segment tracked in
manifest.jsonand a hot manifest copy - Supports multiple storage backends (local, S3, Azure, GCS)
Parquet Compression
Table COMPRESSION is a cold-tier Parquet setting for USER and SHARED tables. KalamDB
currently supports three values:
| Value | Meaning | Typical use |
|---|---|---|
none | Writes uncompressed Parquet pages. | Debugging, CPU-constrained flushes, or data already compressed at the value level |
snappy | Fast Parquet compression and decompression with low CPU overhead. This is the default. | General workloads and latency-sensitive flushes |
zstd | Zstandard Parquet compression at level 1 for better storage density with modest extra CPU. | Larger historical datasets or storage-cost-sensitive tables |
The selected codec is applied when USER and SHARED table scopes write batch-N.parquet.tmp and
when tail compaction writes compact-<uuid>.parquet.tmp. After the atomic rename, readers discover
the codec from the Parquet footer, so no manifest-side decompression setting is needed. This setting
does not affect RocksDB hot storage, WebSocket gzip, backups, or stream table log files. STREAM
tables do not accept table Parquet compression because current stream storage does not flush to
Parquet cold segments.
End-to-End Tier Flow
Flush Policy
Tables are configured with a flush policy that determines when data moves from hot to cold tier:
| Policy | Description |
|---|---|
rows:N | Flush after N rows accumulated |
interval:N | Flush every N seconds |
rows:N,interval:N | Flush on whichever threshold is hit first |
How Flushing Works (Current Engine Path)
The flush flow is job-driven and crash-safe:
- DML writes land in RocksDB first.
- The manifest scope is marked
pending_write. STORAGE FLUSH TABLEorSTORAGE FLUSH ALLcreates background flush jobs insystem.jobs.- The flush executor resolves the target scope and scans hot rows in bounded batches.
- The latest
_seqwins per primary key. Latest tombstones stay hot instead of being written to Parquet. FlushScopeWriterwritesbatch-N.parquet.tmp.- The temp file is atomically renamed to
batch-N.parquet. - Manifest metadata for the new segment is persisted through the manifest service.
- Flushed hot rows are deleted from RocksDB in bounded cleanup batches.
Notes:
STREAMandSYSTEMtables are not part of this flush-to-Parquet path.STORAGE FLUSHis asynchronous; monitor progress viasystem.jobs.- Shared scopes flush once per table scope. User scopes flush once per
(table, user)scope.
Flush State Machine
Post-Flush Small-Segment Compaction
KalamDB now has an optional post-flush tail-compaction path under [flush.compaction]:
This path is separate from the flush critical section. After a successful flush, the leader may
enqueue a segment_compact job for the scopes that actually wrote Parquet files.
The compactor:
- walks the manifest from newest to oldest
- selects only the trailing run of small, committed, same-schema segments
- stops when it hits an already-large segment, a non-readable segment, or a schema-version boundary
- rewrites only the latest MVCC winners into
compact-<uuid>.parquet - swaps the manifest tail only if the selected suffix is still unchanged
If a newer flush changed the manifest tail while compaction was running, the compacted output is discarded and the existing manifest remains authoritative.
Only USER and SHARED tables participate in this mechanism.
Manual Flush & Compaction
S3 User-Data Example
Create an S3 storage and bind a USER table to it:
For MinIO-compatible local S3 setup, see /docs/server/integrations/minio.
For user u_42, cold-tier objects are stored under the user template:
Illustrative manifest excerpt:
Per-User Storage Isolation
Path layout is controlled by:
storage.shared_tables_template(default:{namespace}/{tableName})storage.user_tables_template(default:{namespace}/{tableName}/{userId})
Each user’s cold-tier data can still live in a separate directory. This enables:
- Trivial data export — just copy the user’s directory
- Instant deletion — remove the directory for GDPR compliance
- Independent scaling — no cross-user interference
For the manifest-layer details behind this flow, see /docs/server/architecture/manifests.