Skip to Content
ArchitectureStorage Tiers

Storage Tiers

KalamDB uses a dual-tier storage architecture that keeps the write path hot in RocksDB while moving queryable historical state into immutable Parquet segments.

This page describes how USER and SHARED tables move from hot to cold storage. For the table-type rules that decide whether a table can flush at all, see /docs/server/architecture/table-types.

Hot Tier (RocksDB)

The hot tier handles incoming writes in RocksDB-backed table stores.

Characteristics:

  • ⚡ Sub-millisecond write latency
  • Optimized for point lookups, recent writes, and write-heavy workloads
  • Holds the newest live rows and the newest delete tombstones
  • Feeds the flush scheduler through manifest scopes marked pending_write

Cold Tier (Parquet)

Flushed data is written to Apache Parquet files for efficient scans, pruning, and long-term storage.

Characteristics:

  • 📊 Columnar format for efficient analytics
  • High compression ratios
  • Each segment tracked in manifest.json and a hot manifest copy
  • Supports multiple storage backends (local, S3, Azure, GCS)

Parquet Compression

Table COMPRESSION is a cold-tier Parquet setting for USER and SHARED tables. KalamDB currently supports three values:

ValueMeaningTypical use
noneWrites uncompressed Parquet pages.Debugging, CPU-constrained flushes, or data already compressed at the value level
snappyFast Parquet compression and decompression with low CPU overhead. This is the default.General workloads and latency-sensitive flushes
zstdZstandard Parquet compression at level 1 for better storage density with modest extra CPU.Larger historical datasets or storage-cost-sensitive tables

The selected codec is applied when USER and SHARED table scopes write batch-N.parquet.tmp and when tail compaction writes compact-<uuid>.parquet.tmp. After the atomic rename, readers discover the codec from the Parquet footer, so no manifest-side decompression setting is needed. This setting does not affect RocksDB hot storage, WebSocket gzip, backups, or stream table log files. STREAM tables do not accept table Parquet compression because current stream storage does not flush to Parquet cold segments.

End-to-End Tier Flow

Flush Policy

Tables are configured with a flush policy that determines when data moves from hot to cold tier:

SQL
CREATE TABLE app.messages (  id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(),  content TEXT NOT NULL,  created_at TIMESTAMP DEFAULT NOW()) WITH (  TYPE = 'USER',  FLUSH_POLICY = 'rows:1000,interval:60');
PolicyDescription
rows:NFlush after N rows accumulated
interval:NFlush every N seconds
rows:N,interval:NFlush on whichever threshold is hit first

How Flushing Works (Current Engine Path)

The flush flow is job-driven and crash-safe:

  1. DML writes land in RocksDB first.
  2. The manifest scope is marked pending_write.
  3. STORAGE FLUSH TABLE or STORAGE FLUSH ALL creates background flush jobs in system.jobs.
  4. The flush executor resolves the target scope and scans hot rows in bounded batches.
  5. The latest _seq wins per primary key. Latest tombstones stay hot instead of being written to Parquet.
  6. FlushScopeWriter writes batch-N.parquet.tmp.
  7. The temp file is atomically renamed to batch-N.parquet.
  8. Manifest metadata for the new segment is persisted through the manifest service.
  9. Flushed hot rows are deleted from RocksDB in bounded cleanup batches.

Notes:

  • STREAM and SYSTEM tables are not part of this flush-to-Parquet path.
  • STORAGE FLUSH is asynchronous; monitor progress via system.jobs.
  • Shared scopes flush once per table scope. User scopes flush once per (table, user) scope.

Flush State Machine

Post-Flush Small-Segment Compaction

KalamDB now has an optional post-flush tail-compaction path under [flush.compaction]:

TOML
[flush.compaction]enabled = falsemin_eligible_segments = 5max_segments_per_run = 8user_max_segment_rows = 10000shared_max_segment_rows = 25000

This path is separate from the flush critical section. After a successful flush, the leader may enqueue a segment_compact job for the scopes that actually wrote Parquet files.

The compactor:

  1. walks the manifest from newest to oldest
  2. selects only the trailing run of small, committed, same-schema segments
  3. stops when it hits an already-large segment, a non-readable segment, or a schema-version boundary
  4. rewrites only the latest MVCC winners into compact-<uuid>.parquet
  5. swaps the manifest tail only if the selected suffix is still unchanged

If a newer flush changed the manifest tail while compaction was running, the compacted output is discarded and the existing manifest remains authoritative.

Only USER and SHARED tables participate in this mechanism.

Manual Flush & Compaction

SQL
-- Flush a specific tableSTORAGE FLUSH TABLE myapp.messages; -- Flush all tables in a namespaceSTORAGE FLUSH ALL IN myapp; -- Compact cold storageSTORAGE COMPACT TABLE myapp.messages; -- Check storage healthSTORAGE CHECK local EXTENDED;

S3 User-Data Example

Create an S3 storage and bind a USER table to it:

For MinIO-compatible local S3 setup, see /docs/server/integrations/minio.

SQL
CREATE STORAGE s3_prod  TYPE 's3'  BUCKET 'my-kalamdb-prod'  REGION 'us-east-1'  USER_TABLES_TEMPLATE 'users/{namespace}/{tableName}/{userId}'  SHARED_TABLES_TEMPLATE 'shared/{namespace}/{tableName}'; CREATE TABLE chat.messages (  id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(),  conversation_id BIGINT NOT NULL,  content TEXT NOT NULL,  created_at TIMESTAMP DEFAULT NOW()) WITH (  TYPE = 'USER',  STORAGE_ID = 's3_prod',  FLUSH_POLICY = 'rows:1000,interval:60');

For user u_42, cold-tier objects are stored under the user template:

TEXT
s3://my-kalamdb-prod/users/chat/messages/u_42/manifest.jsons3://my-kalamdb-prod/users/chat/messages/u_42/batch-0.parquets3://my-kalamdb-prod/users/chat/messages/u_42/batch-1.parquet

Illustrative manifest excerpt:

JSON
{  "segments": [    {      "id": "batch-1.parquet",      "path": "batch-1.parquet",      "row_count": 1000,      "size_bytes": 184320,      "schema_version": 1,      "status": "committed"    }  ],  "last_sequence_number": 1}

Per-User Storage Isolation

TEXT
data/storage/├── <namespace>/<tableName>/manifest.json                   # shared table path template├── <namespace>/<tableName>/batch-<index>.parquet└── <namespace>/<tableName>/<userId>/manifest.json          # user table path template

Path layout is controlled by:

  • storage.shared_tables_template (default: {namespace}/{tableName})
  • storage.user_tables_template (default: {namespace}/{tableName}/{userId})

Each user’s cold-tier data can still live in a separate directory. This enables:

  • Trivial data export — just copy the user’s directory
  • Instant deletion — remove the directory for GDPR compliance
  • Independent scaling — no cross-user interference

For the manifest-layer details behind this flow, see /docs/server/architecture/manifests.

Last updated on