Why Object Storage Is Becoming the New Database Layer

Hot data still needs fast indexes. Cold data needs cheap columnar storage. The future database needs both.

For a long time, databases were built around a pretty simple assumption:

Keep the data inside the database, index it, and make queries fast.

That model worked really well when applications had smaller datasets, storage was expensive, and most systems only cared about the latest operational state.

But modern applications are different.

They generate events, logs, files, messages, user activity, analytics data, embeddings, audit trails, historical snapshots, and now increasingly huge amounts of AI-generated content. Modern AI systems produce not only final outputs, but also tool-calling traces, reasoning logs, agent workflows, intermediate results, retrieval context, evaluation data, conversation histories, and many other artifacts that companies may want to keep for debugging, compliance, analytics, personalization, or future model improvements.

Honestly, the amount of AI-related data can sometimes become bigger than the actual application data itself.

Some of this data must be very fast.

Some of it only needs to be cheap, durable, and queryable when needed.

In my view, as systems grow, teams often find themselves leaning toward one of three architectural approaches.

The first is the traditional approach that many of us have used for years.

The second is the common modern approach of combining multiple databases, each serving a different type of workload.

The third is a newer approach that treats hot and cold data differently from day one.

As CTO of NAZDAQ, I have seen this challenge repeatedly across organizations scaling their operational and analytical systems. In many cases, the solution is not simply adding more database capacity, but implementing a data synchronization and lifecycle strategy that keeps operational workloads fast while making historical and analytical data accessible where it is needed. This is one of the reasons solutions such as NAZDAQ's CloudSuite Data Sync have become increasingly important.

A common evolution often looks like this:

Start with one database.
Data grows.
Add an in-memory cache.
Add read replicas.
Partition tables.
Maybe shard tables to distribute load and storage.
Add specialized databases for search, documents, vectors, or analytics.
Clean historical data or move it into an archive to keep operational datasets small.
Replicate data into object storage or a warehouse for analytics.
Continue scaling each layer independently.

It works. In fact, many successful companies still run this way today.

But there are now multiple valid paths, and I think it is worth looking at them side by side.

Three Architectural Approaches We Will Explore in This Article

In this article, I will walk through three common ways teams build and scale modern data systems, examining the strengths, trade-offs, and challenges of each approach as data volumes and application requirements grow.

Option 1: Keep Everything in One Relational Database

This is the most traditional approach.

In this model, the relational database is the center of everything. Application data, historical data, and most operational queries all remain in the same system for as long as possible.

As traffic grows, you usually add supporting systems around it.

Application | v Database (all data) | +--> Cache | +--> Read Replicas | +--> Analytics Replication | v Object Storage / Warehouse

This approach is easy to understand and easy to start with.

A relational database gives you transactions, consistency, mature tooling, SQL, backups, operational familiarity, and strong developer productivity.

This is still the default choice for many systems, and honestly there is nothing wrong with it.

Option 2: Use Multiple Databases for Different Workloads

The second approach is what many teams eventually move toward.

Instead of forcing one relational database to do everything, different databases are introduced for different needs.

For example:

a relational database for transactional data
a document database for flexible records
a search engine for full-text search
a key-value store or Redis for caching
a vector database or vector index for embeddings
a warehouse or analytical system for reporting

It often looks something like this:

This approach is sometimes called polyglot persistence.

The main idea is simple: use the right database for the right job.

That can work very well.

It allows each workload to use a storage engine that fits its access pattern better.

But it also introduces new complexity.

Now you need to think about synchronization, consistency between systems, duplicate data pipelines, operational overhead, multiple query models, multiple backup strategies, and more integration work between databases.

Option 3: Keep Only Hot Data Hot

The third approach starts with a different assumption:

Not all data deserves to stay in operational storage forever.

Recent and active data stays in the database.

Historical data gets archived into cheap columnar storage such as Parquet files on object storage.

The database remains smaller and focused on operational workloads.

Historical data is still queryable without occupying expensive hot storage.

To be clear, this pattern can be implemented with almost any database today. Many teams already build custom archival pipelines that move older data into object storage, data lakes, or analytical systems while keeping recent data in their operational database.

The challenge is that most traditional databases do not provide this behavior natively. Developers typically need to write their own application logic, background jobs, retention policies, synchronization processes, query routing, and retrieval mechanisms to make hot and cold data work together seamlessly.

This is one of the areas where KalamDB aims to be different. Instead of treating hot and cold storage as separate systems that developers must manually coordinate, the goal is to make data lifecycle management a native capability of the database itself. Hot data can remain in fast operational storage, while historical data can be automatically archived into columnar formats on object storage and still remain accessible through SQL.

This is the model I find especially interesting today because it aligns with where many systems appear to be heading.

Comparing the Three Approaches

Option 1: One Relational Database

Pros	Cons
Very simple to start with	Database grows continuously
Strong transactional guarantees	Indexes become very large
Mature tooling and SQL ecosystem	Backups become larger and slower
Familiar to most teams	Compaction and maintenance become more expensive
Easy operational model early on	Often requires adding caches and replicas later
Historical data stays directly available	Analytics usually requires replication into another system
Good developer productivity	Eventually multiple supporting systems still appear

Option 2: Multiple Databases for Different Workloads

Pros	Cons
Each workload can use the best-fit database	Higher operational complexity
Good fit for diverse data types	Data synchronization becomes harder
Better flexibility for search, documents, vectors, etc	Consistency between systems becomes more difficult
Can improve performance for specialized workloads	Teams need expertise across multiple systems
Often a practical step for growing platforms	Querying across systems becomes harder
Lets teams scale different services independently	More moving parts, backup policies, and failure modes
Useful when application needs are already very diverse	Can create silos between operational and analytical data

This architecture is very common today.

And to be clear, it is not a bad architecture.

For many companies, it is the most practical way to scale beyond a single relational system.

But it also tends to increase the number of systems that developers and operators need to understand.

Option 3: Hot/Cold Data Architecture

Pros	Cons
Smaller indexes from day one	Requires lifecycle management
Lower operational database costs	More architectural planning upfront
Faster operational queries	Historical queries may be slower
Better separation between operational and analytical workloads	Requires archive and retrieval mechanisms
Object storage costs are significantly lower	Teams must understand multiple storage layers
Easier long-term scaling	More moving parts initially
Historical data remains queryable	Not every workload benefits equally
Designed for growth from the beginning	Requires good tooling and automation

The biggest advantage, at least in my opinion, is that you are preparing for scale from the start instead of continuously adding new systems later.

Hot Data Should Stay Small and Fast

Hot data is the data your application needs right now.

For example:

sql snippetSQL

1SELECT * FROM orders WHERE status = 'pending';2 3SELECT * FROM messages4WHERE room_id = 425ORDER BY created_at DESC6LIMIT 50;7 8SELECT * FROM users WHERE id = 123;

This data needs low latency.

It needs fast indexes.

It may power APIs, dashboards, real-time subscriptions, user sessions, and application screens.

But if we keep everything in the hot layer forever, eventually the hot layer becomes heavy.

Indexes grow.

Compaction becomes more expensive.

Backups become larger.

Queries touch more metadata.

Operational costs increase.

A shorter index is usually a faster index.

So one of the simplest ways to make hot storage faster is not only optimizing the storage engine itself, but reducing what the storage engine actually needs to keep hot.

Keep recent and active data in fast storage.

Move old, inactive, or analytical data somewhere else.

Sounds simple, but it has a surprisingly big impact.

Cold Data Is Not Dead Data

Cold data is not useless data.

It just has a different access pattern.

Examples:

old user activity
archived messages
historical orders
logs
analytical snapshots
embeddings
columnar datasets

This data may still be extremely valuable.

But it does not need to sit inside the most expensive and latency-sensitive part of the database forever.

This is why object storage and columnar formats are becoming more important.

Object storage gives us cheap, durable, distributed storage.

Columnar formats like Parquet give us efficient scans, compression, and much better performance for analytical queries.

Together, they are changing the shape of the database.

Instead of saying:

The database owns all data forever.

Newer systems are moving toward:

The database manages hot data directly, and cold data lives in open storage that can still be queried.

This is actually a pretty big shift if you think about it.

Object Storage Is Becoming a Database Building Block

Colin Breck's post "Predicting the Future of Distributed Systems" makes a very important point:

Object storage is no longer just a place for backups, archives, or batch data.

It is becoming a core building block for distributed systems.

The reason is simple.

Object storage is mature, durable, widely available, and relatively cheap.

It also gives systems a common storage layer that can be shared across services, clouds, and tools.

That does not mean every query should directly hit object storage.

That would be way too slow for many operational workloads.

The smarter pattern looks more like this:

Example: AI agent conversations for a specific user

User: Sarah (user_id = 123)

Hot path:
Keep the latest 10 conversations in fast storage
active indexes on user_id and conversation_id
low-latency reads and writes
immediate access for the AI agent and application

Example:
Conversation #91 (today)
Conversation #90
...
Conversation #82

Cold path:
Older conversations (#81 and earlier)
compressed Parquet files
object storage
archived conversation history
cheaper long-term storage

Example:
Conversation #81
Conversation #80
...
Conversation #1

Query layer:
hides the complexity from developers

Example query:

sql snippetSQL

1  SELECT * FROM conversations2  WHERE user_id = 123;$0

The system automatically reads recent conversations from hot storage and older conversations from compressed columnar files in object storage without the application needing to know where the data lives.

This is probably the most important lesson.

Object storage is not replacing every part of the database.

It is becoming the durable and scalable foundation behind many systems.

The database still needs caching, indexing, metadata, query planning, and execution.

But the storage foundation is changing.

The Database Is Becoming a Coordinator

In older architectures, the database was basically the box where all important data lived.

In newer architectures, the database increasingly becomes a coordinator across multiple storage layers.

One layer is optimized for fast operational access.

Another layer is optimized for cheap, durable, long-term storage.

Another layer may be optimized for analytical scans.

Another may hold files, vectors, or columnar data.

The developer should not need to care where every byte lives.

They should still be able to write:

sql snippetSQL

1SELECT * FROM user_events2WHERE user_id = 123;

The system can decide:

recent events come from hot storage
older events come from Parquet/object storage
metadata comes from the database
large payloads come from files
vector data can live near the historical dataset

This is where the architecture becomes powerful.

Not because it uses more technologies.

But because it uses each storage type for what it is actually good at.

Why This Solves Multiple Problems

In Arabic, we say:

killing two birds with one stone.

That is what hot/cold storage can do when designed properly.

In reality, it often solves far more than two problems at once.

1. Faster Operational Databases

When historical data leaves the hot path, indexes become smaller.

Smaller indexes are easier to keep in memory, faster to scan, and cheaper to maintain.

2. Lower Storage Costs

Object storage is dramatically cheaper than keeping years of historical data inside operational databases. Columnar formats such as Parquet also compress data extremely well, often reducing storage size significantly compared to keeping the same data in hot operational storage.

3. Better Analytics

Historical data naturally fits columnar formats such as Parquet.

Analytical workloads become more efficient without impacting operational traffic, and most modern analytics tools already support these columnar formats natively.

4. Simpler Scaling

Instead of continuously adding caches, replicas, and specialized systems later, the architecture is already prepared for growth.

5. Better Separation of Concerns

Operational workloads stay operational.

Analytical workloads stay analytical.

Each storage layer focuses on what it does best.

This Pattern Is Already Showing Up in Successful Systems

This direction is not only theoretical.

Some successful modern databases are already moving toward architectures where local disk is no longer the center of the system.

ClickHouse Cloud is one example.

ClickHouse has been moving toward separation of storage and compute.

In this model, object storage can hold shared table data, while compute nodes can scale more independently.

Neon is another strong example.

Neon separates Postgres compute from storage.

Its architecture uses object storage as the durable long-term layer, while keeping latency-sensitive work close to compute using RAM, local NVMe, and pageservers.

That teaches a very important lesson:

Object storage is excellent for durability, scale, and cost — but the hot query path still needs smart serving layers.

So the future is not:

Put everything directly on S3 and hope queries are fast.

The better architecture is:

Use object storage as the durable foundation.
Use hot storage and cache for low latency.
Use columnar formats for analytical scans.
Use SQL to hide the complexity from developers.

Why Apache DataFusion Matters for KalamDB

KalamDB uses Apache DataFusion as its query engine to provide a unified SQL layer across both hot and cold storage.

DataFusion is a Rust-based query engine built around Apache Arrow.

It supports SQL, DataFrame APIs, Parquet, CSV, JSON, Avro, object-store access, query optimization, and extension points.

This fits naturally with KalamDB's architecture, where hot data lives in RocksDB and historical data is archived into Parquet files.

From a developer's perspective, there is no need to query different storage systems separately. You simply write a SQL query, and KalamDB automatically determines where the data lives and how to retrieve it.

A query may read:

only hot data from RocksDB
only cold data from Parquet files
a combination of both hot and cold data
indexes and embeddings alongside stored records

DataFusion provides the execution and optimization layer that allows KalamDB to unify these sources into a single query result.

This allows KalamDB to focus on:

hot operational storage
real-time SQL subscriptions
metadata management
data movement between hot and cold tiers
developer experience

While DataFusion handles query planning, optimization, and execution across the different storage layers.

Personally, I think this is one of the reasons DataFusion has become so interesting recently.

This Is Not Only About Analytics

Hot/cold architecture is often discussed as an analytics pattern.

But it is becoming useful for operational systems too.

Imagine a chat application.

Recent messages need to be fast:

sql snippetSQL

1SELECT * FROM messages2WHERE room_id = 103ORDER BY created_at DESC4LIMIT 50;

But messages from three years ago do not need to sit in the same hot index.

They can be moved to cold storage.

The application still supports search, history, export, analytics, and compliance.

But the hot database stays small and fast.

The user experience improves, and infrastructure costs go down.

Another example is user activity.

The last few days of activity may be needed for real-time features.

But older activity can move to Parquet files in object storage.

The system can still query it later for reports, personalization, debugging, or audits.

The database should help manage this lifecycle automatically instead of forcing developers to build everything themselves.

Where KalamDB Fits

KalamDB is being built around this direction.

The goal is pretty simple:

Keep hot data fast.

Archive historical data into cheap columnar storage.

Allow SQL to work across both.

Support real-time applications without forcing all data to remain hot forever.

In this model, RocksDB can be used for operational storage while older data moves into Parquet files on object storage.

Apache DataFusion provides the query layer.

Hot path:
recent rows
active indexes
real-time subscriptions
low-latency queries

Cold path:
archived rows
historical events
columnar files
cheaper object storage
analytical reads

Query layer:
Apache DataFusion
SQL
open formats
optimizer

Developers should not need to think about two completely different worlds.

They should simply use SQL and let the system handle the details.

At least that's the idea.

KalamDB hot cold storage - flush in objectstore

Final Thought

The main point of this article is not that relational databases are going away, or that every system should immediately adopt a hot/cold architecture.

Relational databases remain one of the most successful and productive technologies ever created. For many applications, keeping everything in a single database is still the right choice. And for organizations with diverse workloads, a multi-database architecture can be a very practical way to scale.

What I think is changing is our understanding of where data should live over time.

As datasets continue to grow, and as AI systems generate increasing amounts of operational, analytical, and historical data, treating all data as equally hot becomes harder to justify. The cost of storing, indexing, backing up, and maintaining years of historical information inside operational databases keeps increasing, while object storage and open columnar formats continue to become more capable and more economical.

That is why I believe object storage is evolving from a backup destination into a foundational layer of modern data systems.

The database is no longer just a place where data lives. Increasingly, it becomes a coordinator across storage layers, balancing low-latency operational access with scalable and cost-effective long-term storage.

Whether you choose a single relational database, a collection of specialized databases, or a hot/cold architecture, the same principle applies:

Hot data should stay fast.

Cold data should stay cheap.

Both should remain accessible.

And the systems that make that balance easiest for developers will likely define the next generation of database architecture.

References

Colin Breck — “Predicting the Future of Distributed Systems”
https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/
Colin Breck — “Object Storage and In-Process Databases are Changing Distributed Systems”
https://blog.colinbreck.com/object-storage-and-in-process-databases-are-changing-distributed-systems/
ClickHouse Docs — “Separation of Storage and Compute”
https://clickhouse.com/docs/guides/separation-storage-compute
Neon Docs — “Neon’s Lakebase Architecture”
https://neon.com/docs/introduction/architecture-overview
Apache DataFusion Docs — “Introduction”
https://datafusion.apache.org/user-guide/introduction.html
DuckDB Docs — “Parquet Files”
https://duckdb.org/docs/current/data/parquet/overview
Apache Iceberg Docs
https://iceberg.apache.org/
NAZDAQ Data Sync solution
https://www.nazdaq-it.com/solutions/infor-cloudsuite/cloudsuite-data-sync

Why Object Storage Is Becoming the New Database Layer

Three Architectural Approaches We Will Explore in This Article

Option 1: Keep Everything in One Relational Database

Option 2: Use Multiple Databases for Different Workloads

Option 3: Keep Only Hot Data Hot

Comparing the Three Approaches

Option 1: One Relational Database

Option 2: Multiple Databases for Different Workloads

Option 3: Hot/Cold Data Architecture

Hot Data Should Stay Small and Fast

Cold Data Is Not Dead Data

Object Storage Is Becoming a Database Building Block

The Database Is Becoming a Coordinator

Why This Solves Multiple Problems

1. Faster Operational Databases

2. Lower Storage Costs

3. Better Analytics

4. Simpler Scaling

5. Better Separation of Concerns

This Pattern Is Already Showing Up in Successful Systems

Why Apache DataFusion Matters for KalamDB

This Is Not Only About Analytics

Where KalamDB Fits

Final Thought

References

Related

While Building KalamDB, I Started Questioning Why We Still Need So Much Backend Code

I Tried Not to Build Another Database — Here’s What I Did Differently

The Hard Part of Realtime SQL: Snapshot First, Then Live Changes