Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 52 additions & 1 deletion blog/2025-09-07-how-to-set-up-postgres-apache-iceberg.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,18 @@ image: /img/blog/cover/postgres-apache-iceberg.webp

Ever wanted to run high-performance analytics on your PostgreSQL data without overloading your production database or breaking your budget? **PostgreSQL to Apache Iceberg replication** is quickly becoming the go-to solution for modern data teams looking to build scalable, cost-effective analytics pipelines.

This comprehensive guide will walk you through everything you need to know about setting up real-time CDC replication from PostgreSQL to Iceberg, including best practices, common pitfalls, and a detailed step-by-step implementation using OLake. Whether you're building a modern data lakehouse architecture or optimizing your existing analytics workflows, this tutorial covers all the essential components.
This comprehensive guide will walk you through everything you need to know about setting up real-time CDC replication from PostgreSQL to Iceberg, including best practices, common pitfalls, and a detailed step-by-step implementation using [OLake](https://olake.io/docs/intro). Whether you're building a modern data lakehouse architecture or optimizing your existing analytics workflows, this tutorial covers all the essential components.

![OLake stream selection UI with Full Refresh + CDC mode for dz-stag-users table](/img/blog/2025/12/lakehouse-image.webp)

## Key Takeaways

- **Protect Production Performance**: Offload heavy analytical queries to Iceberg tables, keeping your PostgreSQL database responsive for application traffic
- **Real-time Logical Replication**: PostgreSQL WAL-based [CDC](https://olake.io/docs/understanding/cdc) streams changes to Iceberg with sub-second latency for up-to-date analytics
- **50-75% Cost Reduction**: Organizations report dramatic savings by moving analytics from expensive PostgreSQL RDS to cost-effective S3 + Iceberg architecture
- **Open Format Flexibility**: Store data once and query with any [engine](https://olake.io/iceberg/query-engine/intro) (Trino, Spark, DuckDB, Athena) - switch tools without data migration
- **Enterprise-Ready Reliability**: OLake handles [schema evolution](https://olake.io/docs/understanding/schema-evolution), CDC recovery, and state management automatically for production deployments

## Why PostgreSQL to Iceberg Replication is Essential for Modern Data Teams

### Unlock Scalable Real-Time Analytics Without Production Impact
Expand Down Expand Up @@ -324,4 +332,47 @@ With OLake, you gain access to:
- Production-ready monitoring and management capabilities for enterprise deployments

The combination of PostgreSQL's reliability as an operational database and Apache Iceberg's analytical capabilities creates a powerful foundation for data-driven decision making. Whether you're building real-time dashboards, implementing advanced analytics, or developing machine learning pipelines, this replication strategy provides the scalability and flexibility modern organizations require.

## Frequently Asked Questions

### What's the difference between PostgreSQL and Apache Iceberg?

PostgreSQL is an OLTP database designed for transactional application workloads with fast row-based operations. Apache Iceberg is an open table format optimized for large-scale analytics with columnar storage, built for data lakes rather than operational databases.

### How does PostgreSQL logical replication work?

PostgreSQL writes all changes to a Write-Ahead Log (WAL). Logical replication reads this WAL using replication slots and publications, streaming INSERT, UPDATE, and DELETE operations to downstream systems like Iceberg in real-time without impacting database performance.

### Do I need PostgreSQL superuser privileges for CDC?

No! While superuser simplifies setup, you only need specific privileges: REPLICATION permission, and SELECT access on tables you want to replicate. Cloud providers like AWS RDS and Google Cloud SQL support logical replication with limited-privilege accounts.

### Can I replicate PostgreSQL without enabling logical replication?

Yes! OLake offers JDBC-based Full Refresh and Bookmark-based Incremental sync modes. If you can't modify WAL settings or create replication slots, you can still replicate data using standard PostgreSQL credentials with timestamp-based incremental updates.

### How does OLake handle PostgreSQL schema changes?

OLake automatically detects [schema evolution](https://olake.io/docs/understanding/schema-evolution). When you add, drop, or modify columns in PostgreSQL, these changes propagate to Iceberg tables without breaking your pipeline. The state management ensures schema and data stay synchronized.

### What happens if my PostgreSQL WAL fills up?

Proper replication slot monitoring is crucial. If OLake falls behind, PostgreSQL retains WAL files until they're consumed. OLake provides lag monitoring and automatic recovery to prevent WAL bloat, but you should set appropriate WAL retention limits.

### How do I handle large PostgreSQL databases for initial load?

OLake uses intelligent chunking strategies (CTID-based or batch splits) to load data in parallel without locking tables. A 1TB PostgreSQL database typically loads in 4-8 hours depending on network and storage performance, and the process can be paused/resumed.

### What query engines work with PostgreSQL-sourced Iceberg tables?

Any Iceberg-compatible engine: [Apache Spark](https://olake.io/iceberg/query-engine/spark) for batch processing, [Trino](https://olake.io/iceberg/query-engine/trino)/[Presto](https://olake.io/iceberg/query-engine/presto) for interactive queries, [DuckDB](https://olake.io/iceberg/query-engine/duckdb) for fast analytical workloads, [AWS Athena](https://olake.io/iceberg/query-engine/athena) for serverless SQL, [Snowflake](https://olake.io/iceberg/query-engine/snowflake), [Databricks](https://olake.io/iceberg/query-engine/databricks), and many others - all querying the same data.

### Can I replicate specific PostgreSQL tables or schemas?

Yes! OLake lets you select specific tables, schemas, or even filter rows using SQL WHERE clauses. This selective replication reduces storage costs and improves query performance by replicating only the data you need for analytics.

### What's the cost comparison between PostgreSQL RDS and Iceberg on S3?

PostgreSQL RDS storage costs ~$0.115/GB/month plus compute charges that run 24/7. Iceberg on S3 costs ~$0.023/GB/month (5x cheaper) with compute costs only when querying. Organizations typically save 50-75% on analytics infrastructure.

<BlogCTA/>
62 changes: 56 additions & 6 deletions blog/2025-09-09-mysql-to-apache-iceberg-replication.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,19 @@ image: /img/blog/cover/setup-sql-iceberg.webp

**MySQL** powers countless production applications as a reliable operational database. But when it comes to analytics at scale, running heavy queries directly on MySQL can quickly become expensive, slow, and disruptive to transactional workloads.

That's where **Apache Iceberg** comes in. By replicating MySQL data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries all without burdening your source database.
That's where **[Apache Iceberg](https://olake.io/iceberg/intro)** comes in. By replicating MySQL data into Iceberg tables, you can unlock a modern, open-format data lakehouse that supports real-time analytics, schema evolution, partitioning, and time travel queries all without burdening your source database.

Apache Iceberg is more than an average table format and it's designed for large-scale, cost-effective analytics. With native support for ACID transactions, seamless schema evolution, and compatibility with engines like Trino, Spark, and DuckDB, it's ideal for modern data lakehouses.
[Apache Iceberg](https://olake.io/iceberg/intro) is more than an average table format and it's designed for large-scale, cost-effective analytics. With native support for ACID transactions, seamless schema evolution, and compatibility with [query engines](https://olake.io/iceberg/query-engine/intro) like Trino, Spark, and DuckDB, it's ideal for modern data lakehouses.

In this comprehensive guide, we'll walk through setting up a real-time pipeline from MySQL to Apache Iceberg using OLake, covering both UI and CLI approaches. We'll explore why companies like Netflix, Natural Intelligence, and Memed have successfully migrated to Iceberg architectures, achieving dramatic performance improvements and cost savings.
In this comprehensive guide, we'll walk through setting up a real-time pipeline from MySQL to Apache Iceberg using [OLake](https://olake.io/docs/intro), covering both UI and CLI approaches. We'll explore why companies like Netflix, Natural Intelligence, and Memed have successfully migrated to Iceberg architectures, achieving dramatic performance improvements and cost savings.

## Key Takeaways

- **Offload Analytics from Production**: Replicate MySQL to Iceberg to run heavy analytical queries without impacting your production database performance
- **Real-time Data Sync**: [CDC](https://olake.io/docs/understanding/cdc) via binlogs keeps Iceberg tables up-to-date with sub-second latency for real-time dashboards and reporting
- **Massive Cost Savings**: Companies like Netflix achieved 25% cost reduction and Memed saw 60x faster ETL processing times
- **Open Format Freedom**: Store data once in S3 and query with any engine (Trino, Spark, DuckDB) - no vendor lock-in
- **Enterprise Features Built-in**: Get automatic [schema evolution](https://olake.io/docs/understanding/schema-evolution), ACID transactions, time travel, and [partitioning](https://olake.io/docs/understanding/iceberg-partitioning) without complex engineering

## The Growing Problem: Why MySQL Analytics Hit Performance Walls

Expand Down Expand Up @@ -142,11 +150,11 @@ Before starting your MySQL to Apache Iceberg replication, ensure you have the fo
- Appropriate binlog retention settings

**Destination Catalog for Iceberg:**
- AWS Glue + S3 (recommended for this guide)
- [AWS Glue](https://olake.io/docs/connectors/glue-catalog) + S3 (recommended for this guide)
- Hive Metastore + HDFS/MinIO (alternative)
- Other supported catalogs (Nessie, Polaris, Unity)
- Other [supported catalogs](https://olake.io/docs/writers/iceberg/catalog/intro) (Nessie, Polaris, Unity)

**Optional Query Engine**: Athena/Trino/Presto or Spark SQL for result validation
**Optional Query Engine**: [Athena](https://olake.io/iceberg/query-engine/athena)/[Trino](https://olake.io/iceberg/query-engine/trino)/[Presto](https://olake.io/iceberg/query-engine/presto) or [Spark](https://olake.io/iceberg/query-engine/spark) SQL for result validation

For comprehensive MySQL setup details, follow this documentation: [MySQL Connector Setup](https://olake.io/docs/connectors/mysql)
For AWS Glue catalog quick setup: [Glue Catalog Configuration](https://olake.io/docs/connectors/glue-catalog)
Expand Down Expand Up @@ -396,6 +404,48 @@ Start your MySQL to Apache Iceberg migration today and unlock the full analytica

As the data landscape continues evolving toward open, cloud-native architectures, organizations embracing Apache Iceberg lakehouse patterns position themselves for scalable growth while maintaining operational excellence. The question isn't whether to migrate from MySQL analytics, it's how quickly you can implement this transformation to stay competitive in today's data-driven economy.

## Frequently Asked Questions

### What is the difference between MySQL and Apache Iceberg?

MySQL is an OLTP (Online Transaction Processing) database designed for handling live application transactions with fast reads and writes. Apache Iceberg is an open table format designed for large-scale analytics on data lakes, optimized for complex queries and petabyte-scale data storage.

### How does CDC (Change Data Capture) work with MySQL?

CDC tracks changes in MySQL by reading the binary log (binlog), which records every insert, update, and delete operation. OLake connects to the binlog and streams these changes in real-time to your Iceberg tables without impacting production performance.

### Can I replicate MySQL to Iceberg without CDC?

Yes! OLake offers JDBC-based Full Refresh and Bookmark-based Incremental sync modes. If you don't have permissions to enable binlogs, you can start syncing immediately with standard MySQL credentials.

### What happens to my MySQL schema changes?

OLake automatically handles [schema evolution](https://olake.io/docs/understanding/schema-evolution). When you add, drop, or modify columns in MySQL, these changes are detected and propagated to your Iceberg tables without breaking your pipeline.

### How much does it cost to store data in Iceberg vs MySQL?

Iceberg storage on S3 costs approximately $0.023 per GB/month, compared to MySQL RDS storage at $0.115 per GB/month - that's 5x cheaper. Plus, you separate compute from storage, so you only pay for queries when you run them.

### What query engines can I use with Iceberg tables?

Apache Iceberg is an open format compatible with: [Trino](https://olake.io/iceberg/query-engine/trino), [Presto](https://olake.io/iceberg/query-engine/presto), [Apache Spark](https://olake.io/iceberg/query-engine/spark), [DuckDB](https://olake.io/iceberg/query-engine/duckdb), [AWS Athena](https://olake.io/iceberg/query-engine/athena), [Snowflake](https://olake.io/iceberg/query-engine/snowflake), [Databricks](https://olake.io/iceberg/query-engine/databricks), and many others. You can switch engines anytime without rewriting data.

### How do I handle partitioning for optimal query performance?

Choose partition columns based on your query patterns: use timestamp fields (created_at, updated_at) for time-series queries, or dimensional fields (customer_id, region) for lookup queries. OLake supports regex-based [partitioning configuration](https://olake.io/docs/understanding/iceberg-partitioning).

### Is the initial full load safe for large MySQL databases?

Yes! OLake uses primary key-based chunking to load data in batches without locking your MySQL tables. The process runs in parallel and can be paused/resumed if needed.

### What happens if my replication pipeline fails?

OLake maintains a state.json file that tracks replication progress. If the pipeline fails, it automatically resumes from the last successfully processed position, ensuring no data loss.

### Can I query both MySQL and Iceberg simultaneously?

Absolutely! Your MySQL database continues serving production traffic while Iceberg handles analytics. This separation ensures operational workloads never compete with analytical queries for resources.

Happy syncing! 🧊🐘

<BlogCTA/>
52 changes: 51 additions & 1 deletion blog/2025-09-10-how-to-set-up-mongodb-apache-iceberg.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,15 @@ That's where **Apache Iceberg** comes in. By replicating MongoDB data into Icebe

Apache Iceberg is designed for large-scale, cost-effective analytics with native support for ACID transactions, seamless schema evolution, and compatibility with engines like Trino, Spark, and DuckDB. It's the perfect complement to MongoDB's operational strengths.

In this comprehensive guide, we'll walk through setting up a real-time pipeline from MongoDB to Apache Iceberg using OLake, covering both UI and CLI approaches. We'll explore why companies are successfully migrating to Iceberg architectures, achieving dramatic performance improvements and cost savings.
In this comprehensive guide, we'll walk through setting up a real-time pipeline from MongoDB to Apache Iceberg using [OLake](https://olake.io/docs/intro), covering both UI and CLI approaches. We'll explore why companies are successfully migrating to Iceberg architectures, achieving dramatic performance improvements and cost savings.

## Key Takeaways

- **Solve MongoDB Analytics Bottlenecks**: Run complex aggregations and joins on Iceberg without slowing down your MongoDB production workloads
- **Real-time Change Streams**: MongoDB [Change Streams](https://olake.io/docs/understanding/cdc) provide millisecond-latency CDC to keep Iceberg tables continuously synchronized
- **Handle Flexible Schemas**: OLake automatically manages MongoDB's dynamic [schema evolution](https://olake.io/docs/understanding/schema-evolution), converting BSON documents to Iceberg-compatible structures
- **Petabyte-Scale Analytics**: Query terabytes or petabytes of data using columnar storage on S3, with costs 5x lower than operational MongoDB
- **Multi-Engine Freedom**: Access your MongoDB data through [Trino](https://olake.io/iceberg/query-engine/trino), [Spark](https://olake.io/iceberg/query-engine/spark), [DuckDB](https://olake.io/iceberg/query-engine/duckdb), or [Athena](https://olake.io/iceberg/query-engine/athena) using standard SQL - no MongoDB query language required

## The Growing Problem: Why MongoDB Analytics Hit Performance Walls

Expand Down Expand Up @@ -343,6 +351,48 @@ The combination of MongoDB's operational flexibility and Iceberg's analytical ca

As the data landscape continues evolving toward open, cloud-native architectures, organizations embracing Apache Iceberg lakehouse patterns position themselves for scalable growth while maintaining operational excellence. The question isn't whether to migrate from MongoDB analytics, it's how quickly you can implement this transformation to stay competitive in today's data-driven economy.

## Frequently Asked Questions

### Why can't I just run analytics directly on MongoDB?

MongoDB is optimized for operational workloads with fast document reads/writes. Complex analytical queries (aggregations, joins, large scans) consume significant resources and slow down production applications. Replicating to Iceberg separates analytics from operations, keeping both performant.

### How does MongoDB Change Streams work for CDC?

Change Streams tap into MongoDB's oplog (operation log) to capture every insert, update, and delete in real-time. OLake reads these changes continuously and applies them to Iceberg tables without impacting MongoDB performance or requiring application changes.

### Do I need a MongoDB replica set for replication?

For real-time CDC with Change Streams, yes - MongoDB requires replica set mode. However, OLake also offers JDBC-based Full Refresh and Bookmark-based Incremental modes that work with standalone MongoDB instances if you have permission limitations.

### How does OLake handle MongoDB's flexible schemas?

MongoDB documents in the same collection can have different fields. OLake automatically detects [schema changes](https://olake.io/docs/understanding/schema-evolution) and evolves your Iceberg tables accordingly, adding new columns when new fields appear while maintaining backward compatibility.

### What happens to nested MongoDB documents in Iceberg?

OLake intelligently flattens nested BSON structures into Iceberg-compatible schemas. Complex nested objects become structured columns in Iceberg tables, making them queryable with standard SQL rather than MongoDB's aggregation framework.

### Can I filter which MongoDB collections to replicate?

Yes! OLake allows you to select specific collections and even apply MongoDB aggregation pipeline filters to replicate only the data you need, reducing storage costs and improving query performance.

### How long does the initial MongoDB to Iceberg load take?

Initial load time depends on your data volume and MongoDB performance. OLake processes collections in parallel and can be paused/resumed. For example, a 500GB MongoDB database typically loads in 2-4 hours depending on network and storage speed.

### What's the difference between Change Streams and binlog CDC?

Change Streams is MongoDB's native change tracking mechanism (similar to MySQL binlogs). It provides a stream of document-level changes that OLake captures and applies to Iceberg tables in real-time.

### Can I query both MongoDB and Iceberg simultaneously?

Absolutely! MongoDB continues serving your application traffic while Iceberg handles analytics. This architecture ensures your operational database never competes with analytical workloads for resources.

### How much does Iceberg storage cost compared to MongoDB?

S3 storage for Iceberg costs ~$0.023/GB/month compared to MongoDB Atlas storage at ~$0.25/GB/month (10x cheaper). Plus, Iceberg's columnar format compresses better, and you only pay for compute when running queries.

Happy syncing!

<BlogCTA/>
Loading