Storage Tiers
KalamDB uses a dual-tier storage architecture that balances write speed with query efficiency and long-term retention.
Hot Tier (RocksDB)
The hot tier handles all incoming writes with sub-millisecond latency using RocksDB column families.
Characteristics:
- ⚡ Sub-millisecond write latency
- Organized as column families per table
- Optimized for point lookups and recent data
- Data is buffered here before flushing to cold tier
Cold Tier (Parquet)
Flushed data is written to Apache Parquet files for efficient analytical queries and long-term storage.
Characteristics:
- 📊 Columnar format for efficient analytics
- High compression ratios
- Each segment tracked in
manifest.json - Supports multiple storage backends (local, S3, Azure, GCS)
End-to-End Tier Flow
Flush Policy
Tables are configured with a flush policy that determines when data moves from hot to cold tier:
CREATE TABLE app.messages ( id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(), content TEXT NOT NULL, created_at TIMESTAMP DEFAULT NOW()) WITH ( TYPE = 'USER', FLUSH_POLICY = 'rows:1000,interval:60');| Policy | Description |
|---|---|
rows:N | Flush after N rows accumulated |
interval:N | Flush every N seconds |
rows:N,interval:N | Flush on whichever threshold is hit first |
How Flushing Works (Engine Path)
The flush flow is job-driven and designed to be crash-safe:
- DML writes land in RocksDB first (hot tier).
- Table providers mark the manifest cache entry as
pending_write. STORAGE FLUSH TABLEorSTORAGE FLUSH ALLcreates background flush jobs (system.jobs).- The flush executor performs the actual migration in leader phase (cluster mode).
- For
USERandSHAREDtables, the flush job scans hot rows, keeps latest versions, and filters tombstones from cold output. - Parquet is written to a temp object (
batch-N.parquet.tmp) and then atomically renamed tobatch-N.parquet. - Manifest metadata is updated (segment stats, sequence range, schema version), then persisted.
- Flushed hot rows are removed from RocksDB and partition compaction runs to reclaim space.
Notes:
STREAMtables are not part of this flush path.STORAGE FLUSHis asynchronous; monitor status viasystem.jobs.
Flush State Machine
Manual Flush & Compaction
-- Flush a specific tableSTORAGE FLUSH TABLE myapp.messages; -- Flush all tables in a namespaceSTORAGE FLUSH ALL IN myapp; -- Compact cold storageSTORAGE COMPACT TABLE myapp.messages; -- Check storage healthSTORAGE CHECK local EXTENDED;S3 User-Data Example
Create an S3 storage and bind a USER table to it:
For MinIO-compatible local S3 setup, see MinIO (S3-Compatible).
CREATE STORAGE s3_prod TYPE 's3' BUCKET 'my-kalamdb-prod' REGION 'us-east-1' USER_TABLES_TEMPLATE 'users/{namespace}/{tableName}/{userId}' SHARED_TABLES_TEMPLATE 'shared/{namespace}/{tableName}'; CREATE TABLE chat.messages ( id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(), conversation_id BIGINT NOT NULL, content TEXT NOT NULL, created_at TIMESTAMP DEFAULT NOW()) WITH ( TYPE = 'USER', STORAGE_ID = 's3_prod', FLUSH_POLICY = 'rows:1000,interval:60');For user u_42, cold-tier objects are stored under the user template:
s3://my-kalamdb-prod/users/chat/messages/u_42/manifest.jsons3://my-kalamdb-prod/users/chat/messages/u_42/batch-0.parquets3://my-kalamdb-prod/users/chat/messages/u_42/batch-1.parquetIllustrative manifest excerpt:
{ "segments": [ { "id": "batch-1.parquet", "path": "batch-1.parquet", "row_count": 1000, "size_bytes": 184320, "schema_version": 1, "status": "committed" } ], "last_sequence_number": 1}Per-User Storage Isolation
data/storage/├── <namespace>/<tableName>/manifest.json # shared table path template├── <namespace>/<tableName>/batch-<index>.parquet└── <namespace>/<tableName>/<userId>/manifest.json # user table path templatePath layout is controlled by:
storage.shared_tables_template(default:{namespace}/{tableName})storage.user_tables_template(default:{namespace}/{tableName}/{userId})
Each user’s cold-tier data can still live in a separate directory. This enables:
- Trivial data export — just copy the user’s directory
- Instant deletion — remove the directory for GDPR compliance
- Independent scaling — no cross-user interference