Skip to content

ncwhh/dataflow-jdbc-connection-pool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Limiting JDBC Connections in Dataflow DoFn

This example demonstrates how to avoid creating too many database connections in an Google Cloud Dataflow job by using a singleton JDBC connection pool per worker.

When scaling Dataflow, each worker may create multiple DoFn instances. If each DoFn opens its own JDBC connection, the database can quickly become overloaded. This project shows how to share a limited pool of JDBC connections across all DoFn instances on the same worker.

🔑 Key Idea

  • Use a Java singleton to ensure that only one connection pool exists per worker VM.
  • Each DoFn borrows connections from this shared pool.
  • Total number of connections =
    maxNumWorkers × maximumPoolSize

This guarantees predictable database load, even when scaling your Dataflow job.

🛠️ How It Works

  1. Singleton Connection Pool
    Implemented in JdbcClient.java.
    A static singleton manages JDBC connections, initialized once per worker.

  2. DoFn Usage
    Demonstrated in JdbcDoFn.java.
    The DoFn retrieves connections from the pool instead of opening new ones.

  3. Connection Limits
    The maximum number of concurrent DB connections is controlled by:

    • maxNumWorkers (Dataflow setting)
    • maximumPoolSize (connection pool config)

⚠️ When to Use This Pattern

✅ Use it when:

  • Your pipeline performs lookups or element-wise operations on a database (e.g., enriching records with reference data).
  • You need to scale safely without overwhelming the DB.
  • You cannot use bulk export/import tools for your use case.

❌ Avoid it when:

  • You need to read/write entire tables — in that case, prefer JdbcIO, which is designed for bulk operations.

👉 This pattern does not replace JdbcIO.
It complements it, and is specifically meant for per-element database access in a DoFn where direct JDBC calls are unavoidable.

ℹ️ Limitations

Depending on how Dataflow distributes work across workers, the maximum number of connections might not be fully utilized.
This is because not every worker might be involved in executing the part of the pipeline that requires a JDBC connection.

About

Example: Limit JDBC connections in Dataflow DoFns with a singleton pool

Topics

Resources

Stars

Watchers

Forks

Languages