Clustering & High Availability
KalamDB uses a Multi-Raft architecture to scale reads, writes, and subscriptions across nodes while keeping data strongly replicated per shard.
How the Cluster Is Structured
KalamDB does not run a single Raft group for everything. It runs multiple groups:
Meta: cluster/system metadataDataUserShard(N): user-data shardsDataSharedShard(N): shared-data shards
Each group has its own leader and followers. This means leadership is distributed across the cluster, not centralized to a single node for all workloads.
Sharding is handled by a router that hashes user/table identifiers to shard groups, so users are naturally distributed across shards. This is how KalamDB scales tenant load and isolates shard-level hotspots.
High Concurrent Connection Model
KalamDB accepts WebSocket clients on any node and manages them in a lock-free connection registry:
ConnectionsManagerusesDashMapindexes for fast concurrent connection/subscription access.- Connection lifecycle, authentication timeout, heartbeat timeout, and subscription routing are all centralized there.
- The default concurrent connection guard is high (
10,000), with configurable limits.
This allows high fan-in of realtime clients while keeping routing overhead low.
Leader/Follower Access Pattern
Clients can connect to either leader or follower nodes.
- Reads/subscription bootstrap can be served on the node the client is connected to.
- Writes are always leader-authoritative per Raft group.
- If a request lands on a follower for a write path, KalamDB forwards it to the current leader.
- If leadership changed mid-request, typed
NotLeaderhandling retries through the known leader.
This gives you flexible client routing without sacrificing consistency for writes.
Notification Forwarding Across Nodes
Live subscribers do not need to be connected to the leader node for their shard.
When data changes:
- The shard leader processes the change notification.
- The leader broadcasts forwarded notifications to follower nodes over cluster RPC.
- Each follower dispatches to subscriptions connected locally on that follower.
This is the key reason you can attach users to any node and still receive correct realtime updates for their shard data.
Sharding + HA in Practice
KalamDB shards users across multiple Raft data groups. Each shard group has its own:
- leader election
- replication log
- follower set
- failover boundary
Operationally, this gives you:
- better parallelism (many active leaders across groups)
- shard-level fault isolation
- horizontal scaling by increasing cluster nodes and shard count
- high availability from follower replicas per shard
When a node fails, Raft elects new leaders for affected groups and clients can reconnect to any healthy node.
Cluster Commands
-- List cluster nodes
CLUSTER LIST;
-- Check cluster health
CLUSTER STATUS;
-- Create a snapshot
CLUSTER SNAPSHOT;
-- Trigger a leader election
CLUSTER TRIGGER ELECTION;
-- Transfer leadership to a specific node
CLUSTER TRANSFER LEADER 2;
-- Step down as leader
CLUSTER STEPDOWN;
-- Purge old log entries
CLUSTER PURGE --UPTO 1000;Configuration Knobs
In cluster mode, tune shard topology from cluster config:
cluster.user_shardscluster.shared_shards
These values determine how many independent Raft data groups exist for user and shared workloads. To view all the configurations for the Cluster setup:
Deploying a 3-Node Cluster
The quickest way to start a cluster is with Docker:
KALAMDB_JWT_SECRET="$(openssl rand -base64 32)" \
curl -sSL https://raw.githubusercontent.com/jamals86/KalamDB/main/docker/run/cluster/docker-compose.yml | docker-compose -f - up -dBackup & Restore
-- Backup a namespace
BACKUP DATABASE myapp TO '/backups/myapp-2026-02-18';
-- Restore from backup
RESTORE DATABASE myapp FROM '/backups/myapp-2026-02-18';
-- View backup history
SHOW BACKUPS FOR DATABASE myapp;