Why Object Storage Is Becoming the New Database Layer

Modern apps produce huge amounts of data, but not all of it should stay hot forever. This post explores why object storage, Parquet, and hot/cold architecture are becoming key to keeping databases fast, scalable, and cost-efficient.

Hot data still needs fast indexes. Cold data needs cheap columnar storage. The future database needs both.

For a long time, databases were built around a pretty simple assumption:

Keep the data inside the database, index it, and make queries fast.

That model worked really well when applications had smaller datasets, storage was expensive, and most systems only cared about the latest operational state.

But modern applications are different.

They generate events, logs, files, messages, user activity, analytics data, embeddings, audit trails, historical snapshots, and now increasingly huge amounts of AI-generated content. Modern AI systems produce not only final outputs, but also tool-calling traces, reasoning logs, agent workflows, intermediate results, retrieval context, evaluation data, conversation histories, and many other artifacts that companies may want to keep for debugging, compliance, analytics, personalization, or future model improvements.

Honestly, the amount of AI-related data can sometimes become bigger than the actual application data itself.

Some of this data must be very fast.

Some of it only needs to be cheap, durable, and queryable when needed.

In my view, as systems grow, teams often find themselves leaning toward one of three architectural approaches.

The first is the traditional approach that many of us have used for years.

The second is the common modern approach of combining multiple databases, each serving a different type of workload.

The third is a newer approach that treats hot and cold data differently from day one.

As CTO of NAZDAQ, I have seen this challenge repeatedly across organizations scaling their operational and analytical systems. In many cases, the solution is not simply adding more database capacity, but implementing a data synchronization and lifecycle strategy that keeps operational workloads fast while making historical and analytical data accessible where it is needed. This is one of the reasons solutions such as NAZDAQ's CloudSuite Data Sync have become increasingly important.

A common evolution often looks like this:

  1. Start with one database.
  2. Data grows.
  3. Add an in-memory cache.
  4. Add read replicas.
  5. Partition tables.
  6. Maybe shard tables to distribute load and storage.
  7. Add specialized databases for search, documents, vectors, or analytics.
  8. Clean historical data or move it into an archive to keep operational datasets small.
  9. Replicate data into object storage or a warehouse for analytics.
  10. Continue scaling each layer independently.

It works. In fact, many successful companies still run this way today.

But there are now multiple valid paths, and I think it is worth looking at them side by side.


Three Architectural Approaches We Will Explore in This Article

In this article, I will walk through three common ways teams build and scale modern data systems, examining the strengths, trade-offs, and challenges of each approach as data volumes and application requirements grow.

Option 1: Keep Everything in One Relational Database

This is the most traditional approach.

In this model, the relational database is the center of everything. Application data, historical data, and most operational queries all remain in the same system for as long as possible.

As traffic grows, you usually add supporting systems around it.

Application | v Database (all data) | +--> Cache | +--> Read Replicas | +--> Analytics Replication | v Object Storage / Warehouse


This approach is easy to understand and easy to start with.

A relational database gives you transactions, consistency, mature tooling, SQL, backups, operational familiarity, and strong developer productivity.

This is still the default choice for many systems, and honestly there is nothing wrong with it.

Option 2: Use Multiple Databases for Different Workloads

The second approach is what many teams eventually move toward.

Instead of forcing one relational database to do everything, different databases are introduced for different needs.

For example:

  • a relational database for transactional data
  • a document database for flexible records
  • a search engine for full-text search
  • a key-value store or Redis for caching
  • a vector database or vector index for embeddings
  • a warehouse or analytical system for reporting

It often looks something like this:

Combining multiple type of databases

This approach is sometimes called polyglot persistence.

The main idea is simple: use the right database for the right job.

That can work very well.

It allows each workload to use a storage engine that fits its access pattern better.

But it also introduces new complexity.

Now you need to think about synchronization, consistency between systems, duplicate data pipelines, operational overhead, multiple query models, multiple backup strategies, and more integration work between databases.

Option 3: Keep Only Hot Data Hot

The third approach starts with a different assumption:

Not all data deserves to stay in operational storage forever.

Keep Only Hot Data Hot


Recent and active data stays in the database.

Historical data gets archived into cheap columnar storage such as Parquet files on object storage.

The database remains smaller and focused on operational workloads.

Historical data is still queryable without occupying expensive hot storage.

To be clear, this pattern can be implemented with almost any database today. Many teams already build custom archival pipelines that move older data into object storage, data lakes, or analytical systems while keeping recent data in their operational database.

The challenge is that most traditional databases do not provide this behavior natively. Developers typically need to write their own application logic, background jobs, retention policies, synchronization processes, query routing, and retrieval mechanisms to make hot and cold data work together seamlessly.

This is one of the areas where KalamDB aims to be different. Instead of treating hot and cold storage as separate systems that developers must manually coordinate, the goal is to make data lifecycle management a native capability of the database itself. Hot data can remain in fast operational storage, while historical data can be automatically archived into columnar formats on object storage and still remain accessible through SQL.

This is the model I find especially interesting today because it aligns with where many systems appear to be heading.


Comparing the Three Approaches

Option 1: One Relational Database

Pros

Cons

Very simple to start with

Database grows continuously

Strong transactional guarantees

Indexes become very large

Mature tooling and SQL ecosystem

Backups become larger and slower

Familiar to most teams

Compaction and maintenance become more expensive

Easy operational model early on

Often requires adding caches and replicas later

Historical data stays directly available

Analytics usually requires replication into another system

Good developer productivity

Eventually multiple supporting systems still appear

Option 2: Multiple Databases for Different Workloads

Pros

Cons

Each workload can use the best-fit database

Higher operational complexity

Good fit for diverse data types

Data synchronization becomes harder

Better flexibility for search, documents, vectors, etc

Consistency between systems becomes more difficult

Can improve performance for specialized workloads

Teams need expertise across multiple systems

Often a practical step for growing platforms

Querying across systems becomes harder

Lets teams scale different services independently

More moving parts, backup policies, and failure modes

Useful when application needs are already very diverse

Can create silos between operational and analytical data

This architecture is very common today.

And to be clear, it is not a bad architecture.

For many companies, it is the most practical way to scale beyond a single relational system.

But it also tends to increase the number of systems that developers and operators need to understand.

Option 3: Hot/Cold Data Architecture

Pros

Cons

Smaller indexes from day one

Requires lifecycle management

Lower operational database costs

More architectural planning upfront

Faster operational queries

Historical queries may be slower

Better separation between operational and analytical workloads

Requires archive and retrieval mechanisms

Object storage costs are significantly lower

Teams must understand multiple storage layers

Easier long-term scaling

More moving parts initially

Historical data remains queryable

Not every workload benefits equally

Designed for growth from the beginning

Requires good tooling and automation

The biggest advantage, at least in my opinion, is that you are preparing for scale from the start instead of continuously adding new systems later.


Hot Data Should Stay Small and Fast

Hot data is the data your application needs right now.

For example:

sql snippetSQL
SELECT * FROM orders WHERE status = 'pending'; SELECT * FROM messagesWHERE room_id = 42ORDER BY created_at DESCLIMIT 50; SELECT * FROM users WHERE id = 123;


This data needs low latency.

It needs fast indexes.

It may power APIs, dashboards, real-time subscriptions, user sessions, and application screens.

But if we keep everything in the hot layer forever, eventually the hot layer becomes heavy.

Indexes grow.

Compaction becomes more expensive.

Backups become larger.

Queries touch more metadata.

Operational costs increase.

A shorter index is usually a faster index.

So one of the simplest ways to make hot storage faster is not only optimizing the storage engine itself, but reducing what the storage engine actually needs to keep hot.

Keep recent and active data in fast storage.

Move old, inactive, or analytical data somewhere else.

Sounds simple, but it has a surprisingly big impact.


Cold Data Is Not Dead Data

Cold data is not useless data.

It just has a different access pattern.

Examples:

  • old user activity
  • archived messages
  • historical orders
  • logs
  • analytical snapshots
  • embeddings
  • columnar datasets

This data may still be extremely valuable.

But it does not need to sit inside the most expensive and latency-sensitive part of the database forever.

This is why object storage and columnar formats are becoming more important.

Object storage gives us cheap, durable, distributed storage.

Columnar formats like Parquet give us efficient scans, compression, and much better performance for analytical queries.

Together, they are changing the shape of the database.

Instead of saying:

The database owns all data forever.

Newer systems are moving toward:

The database manages hot data directly, and cold data lives in open storage that can still be queried.

This is actually a pretty big shift if you think about it.


Object Storage Is Becoming a Database Building Block

Colin Breck's post "Predicting the Future of Distributed Systems" makes a very important point:

Object storage is no longer just a place for backups, archives, or batch data.

It is becoming a core building block for distributed systems.

The reason is simple.

Object storage is mature, durable, widely available, and relatively cheap.

It also gives systems a common storage layer that can be shared across services, clouds, and tools.

That does not mean every query should directly hit object storage.

That would be way too slow for many operational workloads.

The smarter pattern looks more like this:

Example: AI agent conversations for a specific user

User: Sarah (user_id = 123)

Hot path:
Keep the latest 10 conversations in fast storage
active indexes on user_id and conversation_id
low-latency reads and writes
immediate access for the AI agent and application

Example:
Conversation #91 (today)
Conversation #90
...
Conversation #82

Cold path:
Older conversations (#81 and earlier)
compressed Parquet files
object storage
archived conversation history
cheaper long-term storage

Example:
Conversation #81
Conversation #80
...
Conversation #1

Query layer:
hides the complexity from developers

Example query:

sql snippetSQL
  SELECT * FROM conversations  WHERE user_id = 123;$0


The system automatically reads recent conversations from hot storage and older conversations from compressed columnar files in object storage without the application needing to know where the data lives.

This is probably the most important lesson.

Object storage is not replacing every part of the database.

It is becoming the durable and scalable foundation behind many systems.

The database still needs caching, indexing, metadata, query planning, and execution.

But the storage foundation is changing.


The Database Is Becoming a Coordinator

In older architectures, the database was basically the box where all important data lived.

In newer architectures, the database increasingly becomes a coordinator across multiple storage layers.

One layer is optimized for fast operational access.

Another layer is optimized for cheap, durable, long-term storage.

Another layer may be optimized for analytical scans.

Another may hold files, vectors, or columnar data.

The developer should not need to care where every byte lives.

They should still be able to write:

sql snippetSQL
SELECT * FROM user_eventsWHERE user_id = 123;


The system can decide:

  • recent events come from hot storage
  • older events come from Parquet/object storage
  • metadata comes from the database
  • large payloads come from files
  • vector data can live near the historical dataset

This is where the architecture becomes powerful.

Not because it uses more technologies.

But because it uses each storage type for what it is actually good at.


Why This Solves Multiple Problems

In Arabic, we say:

killing two birds with one stone.

That is what hot/cold storage can do when designed properly.

In reality, it often solves far more than two problems at once.

1. Faster Operational Databases

When historical data leaves the hot path, indexes become smaller.

Smaller indexes are easier to keep in memory, faster to scan, and cheaper to maintain.

2. Lower Storage Costs

Object storage is dramatically cheaper than keeping years of historical data inside operational databases. Columnar formats such as Parquet also compress data extremely well, often reducing storage size significantly compared to keeping the same data in hot operational storage.

3. Better Analytics

Historical data naturally fits columnar formats such as Parquet.

Analytical workloads become more efficient without impacting operational traffic, and most modern analytics tools already support these columnar formats natively.

4. Simpler Scaling

Instead of continuously adding caches, replicas, and specialized systems later, the architecture is already prepared for growth.

5. Better Separation of Concerns

Operational workloads stay operational.

Analytical workloads stay analytical.

Each storage layer focuses on what it does best.


This Pattern Is Already Showing Up in Successful Systems

This direction is not only theoretical.

Some successful modern databases are already moving toward architectures where local disk is no longer the center of the system.

ClickHouse Cloud is one example.

ClickHouse has been moving toward separation of storage and compute.

In this model, object storage can hold shared table data, while compute nodes can scale more independently.

Neon is another strong example.

Neon separates Postgres compute from storage.

Its architecture uses object storage as the durable long-term layer, while keeping latency-sensitive work close to compute using RAM, local NVMe, and pageservers.

That teaches a very important lesson:

Object storage is excellent for durability, scale, and cost — but the hot query path still needs smart serving layers.

So the future is not:

Put everything directly on S3 and hope queries are fast.

The better architecture is:

Use object storage as the durable foundation.
Use hot storage and cache for low latency.
Use columnar formats for analytical scans.
Use SQL to hide the complexity from developers.

Why Apache DataFusion Matters for KalamDB

KalamDB uses Apache DataFusion as its query engine to provide a unified SQL layer across both hot and cold storage.

DataFusion is a Rust-based query engine built around Apache Arrow.

It supports SQL, DataFrame APIs, Parquet, CSV, JSON, Avro, object-store access, query optimization, and extension points.

This fits naturally with KalamDB's architecture, where hot data lives in RocksDB and historical data is archived into Parquet files.

From a developer's perspective, there is no need to query different storage systems separately. You simply write a SQL query, and KalamDB automatically determines where the data lives and how to retrieve it.

A query may read:

  • only hot data from RocksDB
  • only cold data from Parquet files
  • a combination of both hot and cold data
  • indexes and embeddings alongside stored records

DataFusion provides the execution and optimization layer that allows KalamDB to unify these sources into a single query result.

This allows KalamDB to focus on:

  • hot operational storage
  • real-time SQL subscriptions
  • metadata management
  • data movement between hot and cold tiers
  • developer experience

While DataFusion handles query planning, optimization, and execution across the different storage layers.

Personally, I think this is one of the reasons DataFusion has become so interesting recently.


This Is Not Only About Analytics

Hot/cold architecture is often discussed as an analytics pattern.

But it is becoming useful for operational systems too.

Imagine a chat application.

Recent messages need to be fast:

sql snippetSQL
SELECT * FROM messagesWHERE room_id = 10ORDER BY created_at DESCLIMIT 50;


But messages from three years ago do not need to sit in the same hot index.

They can be moved to cold storage.

The application still supports search, history, export, analytics, and compliance.

But the hot database stays small and fast.

The user experience improves, and infrastructure costs go down.

Another example is user activity.

The last few days of activity may be needed for real-time features.

But older activity can move to Parquet files in object storage.

The system can still query it later for reports, personalization, debugging, or audits.

The database should help manage this lifecycle automatically instead of forcing developers to build everything themselves.


Where KalamDB Fits

KalamDB is being built around this direction.

The goal is pretty simple:

Keep hot data fast.

Archive historical data into cheap columnar storage.

Allow SQL to work across both.

Support real-time applications without forcing all data to remain hot forever.

In this model, RocksDB can be used for operational storage while older data moves into Parquet files on object storage.

Apache DataFusion provides the query layer.

Hot path:
recent rows
active indexes
real-time subscriptions
low-latency queries

Cold path:
archived rows
historical events
columnar files
cheaper object storage
analytical reads

Query layer:
Apache DataFusion
SQL
open formats
optimizer

Developers should not need to think about two completely different worlds.

They should simply use SQL and let the system handle the details.

At least that's the idea.

KalamDB hot cold storage - flush in objectstore


Final Thought

The main point of this article is not that relational databases are going away, or that every system should immediately adopt a hot/cold architecture.

Relational databases remain one of the most successful and productive technologies ever created. For many applications, keeping everything in a single database is still the right choice. And for organizations with diverse workloads, a multi-database architecture can be a very practical way to scale.

What I think is changing is our understanding of where data should live over time.

As datasets continue to grow, and as AI systems generate increasing amounts of operational, analytical, and historical data, treating all data as equally hot becomes harder to justify. The cost of storing, indexing, backing up, and maintaining years of historical information inside operational databases keeps increasing, while object storage and open columnar formats continue to become more capable and more economical.

That is why I believe object storage is evolving from a backup destination into a foundational layer of modern data systems.

The database is no longer just a place where data lives. Increasingly, it becomes a coordinator across storage layers, balancing low-latency operational access with scalable and cost-effective long-term storage.

Whether you choose a single relational database, a collection of specialized databases, or a hot/cold architecture, the same principle applies:

Hot data should stay fast.

Cold data should stay cheap.

Both should remain accessible.

And the systems that make that balance easiest for developers will likely define the next generation of database architecture.


References