Skip to Content
ArchitectureStorage Tiers

Storage Tiers

KalamDB uses a dual-tier storage architecture that balances write speed with query efficiency and long-term retention.

Hot Tier (RocksDB)

The hot tier handles all incoming writes with sub-millisecond latency using RocksDB column families.

Characteristics:

  • ⚡ Sub-millisecond write latency
  • Organized as column families per table
  • Optimized for point lookups and recent data
  • Data is buffered here before flushing to cold tier

Cold Tier (Parquet)

Flushed data is written to Apache Parquet files for efficient analytical queries and long-term storage.

Characteristics:

  • 📊 Columnar format for efficient analytics
  • High compression ratios
  • Each segment tracked in manifest.json
  • Supports multiple storage backends (local, S3, Azure, GCS)

End-to-End Tier Flow

Flush Policy

Tables are configured with a flush policy that determines when data moves from hot to cold tier:

CREATE TABLE app.messages ( id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(), content TEXT NOT NULL, created_at TIMESTAMP DEFAULT NOW() ) WITH ( TYPE = 'USER', FLUSH_POLICY = 'rows:1000,interval:60' );
PolicyDescription
rows:NFlush after N rows accumulated
interval:NFlush every N seconds
rows:N,interval:NFlush on whichever threshold is hit first

How Flushing Works (Engine Path)

The flush flow is job-driven and designed to be crash-safe:

  1. DML writes land in RocksDB first (hot tier).
  2. Table providers mark the manifest cache entry as pending_write.
  3. STORAGE FLUSH TABLE or STORAGE FLUSH ALL creates background flush jobs (system.jobs).
  4. The flush executor performs the actual migration in leader phase (cluster mode).
  5. For USER and SHARED tables, the flush job scans hot rows, keeps latest versions, and filters tombstones from cold output.
  6. Parquet is written to a temp object (batch-N.parquet.tmp) and then atomically renamed to batch-N.parquet.
  7. Manifest metadata is updated (segment stats, sequence range, schema version), then persisted.
  8. Flushed hot rows are removed from RocksDB and partition compaction runs to reclaim space.

Notes:

  • STREAM tables are not part of this flush path.
  • STORAGE FLUSH is asynchronous; monitor status via system.jobs.

Flush State Machine

Manual Flush & Compaction

-- Flush a specific table STORAGE FLUSH TABLE myapp.messages; -- Flush all tables in a namespace STORAGE FLUSH ALL IN myapp; -- Compact cold storage STORAGE COMPACT TABLE myapp.messages; -- Check storage health STORAGE CHECK local EXTENDED;

S3 User-Data Example

Create an S3 storage and bind a USER table to it:

For MinIO-compatible local S3 setup, see MinIO (S3-Compatible).

CREATE STORAGE s3_prod TYPE 's3' BUCKET 'my-kalamdb-prod' REGION 'us-east-1' USER_TABLES_TEMPLATE 'users/{namespace}/{tableName}/{userId}' SHARED_TABLES_TEMPLATE 'shared/{namespace}/{tableName}'; CREATE TABLE chat.messages ( id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(), conversation_id BIGINT NOT NULL, content TEXT NOT NULL, created_at TIMESTAMP DEFAULT NOW() ) WITH ( TYPE = 'USER', STORAGE_ID = 's3_prod', FLUSH_POLICY = 'rows:1000,interval:60' );

For user u_42, cold-tier objects are stored under the user template:

s3://my-kalamdb-prod/users/chat/messages/u_42/manifest.json s3://my-kalamdb-prod/users/chat/messages/u_42/batch-0.parquet s3://my-kalamdb-prod/users/chat/messages/u_42/batch-1.parquet

Illustrative manifest excerpt:

{ "segments": [ { "id": "batch-1.parquet", "path": "batch-1.parquet", "row_count": 1000, "size_bytes": 184320, "schema_version": 1, "status": "committed" } ], "last_sequence_number": 1 }

Per-User Storage Isolation

data/storage/ ├── <namespace>/<tableName>/manifest.json # shared table path template ├── <namespace>/<tableName>/batch-<index>.parquet └── <namespace>/<tableName>/<userId>/manifest.json # user table path template

Path layout is controlled by:

  • storage.shared_tables_template (default: {namespace}/{tableName})
  • storage.user_tables_template (default: {namespace}/{tableName}/{userId})

Each user’s cold-tier data can still live in a separate directory. This enables:

  • Trivial data export — just copy the user’s directory
  • Instant deletion — remove the directory for GDPR compliance
  • Independent scaling — no cross-user interference
Last updated on