Skip to Content
ArchitectureStream Storage

Stream Storage Architecture

KalamDB stream storage is the runtime path behind STREAM tables. It is optimized for short-lived, user-scoped rows such as presence, typing indicators, cursor positions, and transient events.

Unlike USER and SHARED tables, stream rows do not flush into Parquet or participate in the manifest-driven cold tier. The engine keeps stream data in append-only log files sized for fast TTL eviction.

For the SQL creation model and table-type rules, see /docs/server/architecture/table-types.

Stream Storage File Layout

With the default data directory, each stream table gets a table-scoped base path under:

text snippetTEXT
data/streams/<namespace>/<table>

Inside that base path, rows are written into TTL-sized window directories:

text snippetTEXT
data/streams/<namespace>/<table>/w<window_start_ms>-<duration_ms>/<shard>/<user_id>.log

Example:

text snippetTEXT
data/streams/chat/typing_events/├── w1748343000000-60000/│   ├── shard-0/alice.log│   └── shard-2/bob.log└── w1748343060000-60000/    └── shard-0/alice.log

The layout is intentionally simple:

  • one immutable time window directory per retention bucket
  • one shard subdirectory to distribute active writers
  • one <user_id>.log file per user inside that shard/window

KalamDB uses <user_id>.log directly instead of <user_id>/records.log because the validated user_id alphabet is already filename-safe, so the extra directory level would add metadata churn without adding useful structure.

Why the Window Name Encodes Time

Each window directory name encodes two values:

  • window_start_ms
  • duration_ms

That lets cleanup decide whether a whole window is expired without opening any log file.

If window_end <= ttl_cutoff, the entire directory can be removed in one operation. This avoids a row scan during normal retention and keeps cleanup proportional to the number of expired windows, not the number of rows.

Bucket Sizing From TTL_SECONDS

The bucket size is derived from the table’s TTL_SECONDS value:

TTL_SECONDSWindow size
<= 15 minutesminute windows
<= 1 dayhour windows
<= 1 weekday windows
<= 30 daysweek windows
> 30 daysmonth windows

Short-lived streams use minute windows so eviction stays close to the actual retention period. Longer-lived streams use coarser windows to avoid creating too many tiny files.

Write Path

When a row is inserted or deleted in a STREAM table:

  1. the effective user_id selects the per-user stream scope
  2. the row timestamp selects the current time window
  3. the user is routed into a shard directory
  4. the record is appended to <user_id>.log as a length-prefixed flexbuffers frame

The file-backed stream store keeps a bounded writer cache:

  • 64 KiB buffered writes per open segment
  • up to 256 cached segment writers per store
  • cold writers flushed and closed when the cache exceeds the cap

The stream log does not call fsync on every append. Stream tables are intended for ephemeral state, so the design favors low write latency and relies on the OS page cache for writeback.

Stream Storage Cleanup

TTL eviction is window-based, not row-by-row.

The stream eviction job computes a cutoff timestamp and then:

  1. scans window directory names under the table’s stream base path
  2. identifies windows whose computed window_end is older than the cutoff
  3. closes any cached writers under those directories
  4. removes the expired window directories

This is why the time-window layout matters: cleanup is mostly directory unlink work instead of file parsing or per-row filtering.

The pre-validation path also benefits from this layout. has_logs_before() can stop as soon as it finds one expired window directory, and the job executor runs that filesystem work on Tokio’s blocking pool so async workers are not pinned by directory traversal or removal.

Read Path

Reads remain user-scoped.

To serve a range or latest query, the stream store:

  1. computes the shard for the target user_id
  2. lists only that user’s candidate window files
  3. filters windows by overlap with the requested time range
  4. reads only those candidate log files
  5. applies delete records so latest state wins for each stream row ID

This keeps reads narrow even when many users share the same stream table.

What Stream Storage Is Not

Stream storage is deliberately separate from the Parquet cold tier used by USER and SHARED tables.

  • no Parquet flush path
  • no manifest-driven segment lifecycle
  • no long-term analytic storage guarantees
  • no file-column support

If you need durable historical records or analytics, use a USER or SHARED table and the cold storage flow documented in /docs/server/architecture/storage-tiers.

If you need short-lived realtime state that should expire automatically, STREAM storage is the intended path.

Last updated on