Storage Tiers

KalamDB uses a dual-tier storage architecture that balances write speed with query efficiency and long-term retention.

Hot Tier (RocksDB)

The hot tier handles all incoming writes with sub-millisecond latency using RocksDB column families.

Characteristics:

⚡ Sub-millisecond write latency
Organized as column families per table
Optimized for point lookups and recent data
Data is buffered here before flushing to cold tier

Cold Tier (Parquet)

Flushed data is written to Apache Parquet files for efficient analytical queries and long-term storage.

Characteristics:

📊 Columnar format for efficient analytics
High compression ratios
Each segment tracked in manifest.json
Supports multiple storage backends (local, S3, Azure, GCS)

End-to-End Tier Flow

Flush Policy

Tables are configured with a flush policy that determines when data moves from hot to cold tier:


CREATE TABLE app.messages (
  id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(),
  content TEXT NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
) WITH (
  TYPE = 'USER',
  FLUSH_POLICY = 'rows:1000,interval:60'
);

Policy	Description
`rows:N`	Flush after N rows accumulated
`interval:N`	Flush every N seconds
`rows:N,interval:N`	Flush on whichever threshold is hit first

How Flushing Works (Engine Path)

The flush flow is job-driven and designed to be crash-safe:

DML writes land in RocksDB first (hot tier).
Table providers mark the manifest cache entry as pending_write.
STORAGE FLUSH TABLE or STORAGE FLUSH ALL creates background flush jobs (system.jobs).
The flush executor performs the actual migration in leader phase (cluster mode).
For USER and SHARED tables, the flush job scans hot rows, keeps latest versions, and filters tombstones from cold output.
Parquet is written to a temp object (batch-N.parquet.tmp) and then atomically renamed to batch-N.parquet.
Manifest metadata is updated (segment stats, sequence range, schema version), then persisted.
Flushed hot rows are removed from RocksDB and partition compaction runs to reclaim space.

Notes:

STREAM tables are not part of this flush path.
STORAGE FLUSH is asynchronous; monitor status via system.jobs.

Flush State Machine

Manual Flush & Compaction


-- Flush a specific table
STORAGE FLUSH TABLE myapp.messages;
 
-- Flush all tables in a namespace
STORAGE FLUSH ALL IN myapp;
 
-- Compact cold storage
STORAGE COMPACT TABLE myapp.messages;
 
-- Check storage health
STORAGE CHECK local EXTENDED;

S3 User-Data Example

Create an S3 storage and bind a USER table to it:

For MinIO-compatible local S3 setup, see MinIO (S3-Compatible).


CREATE STORAGE s3_prod
  TYPE 's3'
  BUCKET 'my-kalamdb-prod'
  REGION 'us-east-1'
  USER_TABLES_TEMPLATE 'users/{namespace}/{tableName}/{userId}'
  SHARED_TABLES_TEMPLATE 'shared/{namespace}/{tableName}';
 
CREATE TABLE chat.messages (
  id BIGINT PRIMARY KEY DEFAULT SNOWFLAKE_ID(),
  conversation_id BIGINT NOT NULL,
  content TEXT NOT NULL,
  created_at TIMESTAMP DEFAULT NOW()
) WITH (
  TYPE = 'USER',
  STORAGE_ID = 's3_prod',
  FLUSH_POLICY = 'rows:1000,interval:60'
);

For user u_42, cold-tier objects are stored under the user template:


s3://my-kalamdb-prod/users/chat/messages/u_42/manifest.json
s3://my-kalamdb-prod/users/chat/messages/u_42/batch-0.parquet
s3://my-kalamdb-prod/users/chat/messages/u_42/batch-1.parquet

Illustrative manifest excerpt:


{
  "segments": [
    {
      "id": "batch-1.parquet",
      "path": "batch-1.parquet",
      "row_count": 1000,
      "size_bytes": 184320,
      "schema_version": 1,
      "status": "committed"
    }
  ],
  "last_sequence_number": 1
}

Per-User Storage Isolation


data/storage/
├── <namespace>/<tableName>/manifest.json                   # shared table path template
├── <namespace>/<tableName>/batch-<index>.parquet
└── <namespace>/<tableName>/<userId>/manifest.json          # user table path template

Path layout is controlled by:

storage.shared_tables_template (default: {namespace}/{tableName})
storage.user_tables_template (default: {namespace}/{tableName}/{userId})

Each user’s cold-tier data can still live in a separate directory. This enables:

Trivial data export — just copy the user’s directory
Instant deletion — remove the directory for GDPR compliance
Independent scaling — no cross-user interference