Clustering & High Availability

KalamDB uses a Multi-Raft architecture to scale reads, writes, and subscriptions across nodes while keeping data strongly replicated per shard.

How the Cluster Is Structured

KalamDB does not run a single Raft group for everything. It runs multiple groups:

Meta: cluster/system metadata
DataUserShard(N): user-data shards
DataSharedShard(N): shared-data shards

Each group has its own leader and followers. This means leadership is distributed across the cluster, not centralized to a single node for all workloads.

Sharding is handled by a router that hashes user/table identifiers to shard groups, so users are naturally distributed across shards. This is how KalamDB scales tenant load and isolates shard-level hotspots.

High Concurrent Connection Model

KalamDB accepts WebSocket clients on any node and manages them in a lock-free connection registry:

ConnectionsManager uses DashMap indexes for fast concurrent connection/subscription access.
Connection lifecycle, authentication timeout, heartbeat timeout, and subscription routing are all centralized there.
The default concurrent connection guard is high (10,000), with configurable limits.

This allows high fan-in of realtime clients while keeping routing overhead low.

Leader/Follower Access Pattern

Clients can connect to either leader or follower nodes.

Reads/subscription bootstrap can be served on the node the client is connected to.
Writes are always leader-authoritative per Raft group.
If a request lands on a follower for a write path, KalamDB forwards it to the current leader.
If leadership changed mid-request, typed NotLeader handling retries through the known leader.

This gives you flexible client routing without sacrificing consistency for writes.

Notification Forwarding Across Nodes

Live subscribers do not need to be connected to the leader node for their shard.

When data changes:

The shard leader processes the change notification.
The leader broadcasts forwarded notifications to follower nodes over cluster RPC.
Each follower dispatches to subscriptions connected locally on that follower.

This is the key reason you can attach users to any node and still receive correct realtime updates for their shard data.

Sharding + HA in Practice

KalamDB shards users across multiple Raft data groups. Each shard group has its own:

leader election
replication log
follower set
failover boundary

Operationally, this gives you:

better parallelism (many active leaders across groups)
shard-level fault isolation
horizontal scaling by increasing cluster nodes and shard count
high availability from follower replicas per shard

When a node fails, Raft elects new leaders for affected groups and clients can reconnect to any healthy node.

Cluster Commands


-- List cluster nodes
CLUSTER LIST;
 
-- Check cluster health
CLUSTER STATUS;
 
-- Create a snapshot
CLUSTER SNAPSHOT;
 
-- Trigger a leader election
CLUSTER TRIGGER ELECTION;
 
-- Transfer leadership to a specific node
CLUSTER TRANSFER LEADER 2;
 
-- Step down as leader
CLUSTER STEPDOWN;
 
-- Purge old log entries
CLUSTER PURGE --UPTO 1000;

Configuration Knobs

In cluster mode, tune shard topology from cluster config:

cluster.user_shards
cluster.shared_shards

These values determine how many independent Raft data groups exist for user and shared workloads. To view all the configurations for the Cluster setup:

More cluster configurations overview

Deploying a 3-Node Cluster

The quickest way to start a cluster is with Docker:


KALAMDB_JWT_SECRET="$(openssl rand -base64 32)" \
curl -sSL https://raw.githubusercontent.com/jamals86/KalamDB/main/docker/run/cluster/docker-compose.yml | docker-compose -f - up -d

Backup & Restore


-- Backup a namespace
BACKUP DATABASE myapp TO '/backups/myapp-2026-02-18';
 
-- Restore from backup
RESTORE DATABASE myapp FROM '/backups/myapp-2026-02-18';
 
-- View backup history
SHOW BACKUPS FOR DATABASE myapp;