This example demonstrates how to avoid creating too many database connections in an Google Cloud Dataflow job by using a singleton JDBC connection pool per worker.
When scaling Dataflow, each worker may create multiple DoFn
instances. If each DoFn
opens its own JDBC connection, the database can quickly become overloaded. This project shows how to share a limited pool of JDBC connections across all DoFn
instances on the same worker.
- Use a Java singleton to ensure that only one connection pool exists per worker VM.
- Each
DoFn
borrows connections from this shared pool. - Total number of connections =
maxNumWorkers × maximumPoolSize
This guarantees predictable database load, even when scaling your Dataflow job.
-
Singleton Connection Pool
Implemented inJdbcClient.java
.
A static singleton manages JDBC connections, initialized once per worker. -
DoFn Usage
Demonstrated inJdbcDoFn.java
.
TheDoFn
retrieves connections from the pool instead of opening new ones. -
Connection Limits
The maximum number of concurrent DB connections is controlled by:maxNumWorkers
(Dataflow setting)maximumPoolSize
(connection pool config)
✅ Use it when:
- Your pipeline performs lookups or element-wise operations on a database (e.g., enriching records with reference data).
- You need to scale safely without overwhelming the DB.
- You cannot use bulk export/import tools for your use case.
❌ Avoid it when:
- You need to read/write entire tables — in that case, prefer
JdbcIO
, which is designed for bulk operations.
👉 This pattern does not replace JdbcIO
.
It complements it, and is specifically meant for per-element database access in a DoFn
where direct JDBC calls are unavoidable.
Depending on how Dataflow distributes work across workers, the maximum number of connections might not be fully utilized.
This is because not every worker might be involved in executing the part of the pipeline that requires a JDBC connection.