Fix Postgres OOM by limiting fetch size #758

gharris1727 · 2019-12-05T23:10:53Z

This PR adds an integration test with an embedded postgres instance.
There are three tests:

Verify that the test setup produces OOM exceptions with large batch sizes.
Verify that the test setup does not produce OOM exceptions with small batch sizes.
Verify that tables are not left locked when autoCommit is disabled.

Test 3 addresses a side-effect of autoCommit that was called out in #466, where read-only queries with autoCommit disabled would leave open locks on the database. This test verifies that the JdbcSourceTask does not leave open locks from any operations.

#466 specifically observed that the TableMonitorThread would leave open locks if auto-commit was disabled on that connection. This PR circumvents that issue by leaving the TableMonitorThread connection as-is, with autoCommit enabled. There are no tests covering the Connector or TableMonitorThread, because the behavior of those classes is unaffected by these changes. A connector test would also have some extra complication as the TableMonitorThread is asynchronous.

Fixes #34 .

Signed-off-by: Greg Harris <gregh@confluent.io>

levzem

two minor comment but otherwise LGTM

src/main/java/io/confluent/connect/jdbc/source/TableQuerier.java

src/main/java/io/confluent/connect/jdbc/source/JdbcSourceTask.java

Signed-off-by: Greg Harris <gregh@confluent.io>

levzem

LGTM

gharris1727 · 2019-12-06T21:52:22Z

Note that io.zonky.test:embedded-postgres is Apache 2.0 licensed.

wicknicks

nice work, @gharris1727. some comments/questions.

src/main/java/io/confluent/connect/jdbc/source/TableQuerier.java

src/test/java/io/confluent/connect/jdbc/source/integration/PostgresOOMIT.java

wicknicks · 2019-12-07T05:41:45Z

src/test/java/io/confluent/connect/jdbc/source/integration/PostgresOOMIT.java

+
+    task = new JdbcSourceTask();
+    task.start(props);
+  }


do we need to test what happens when the table monitor thread is running in the same JVM as these tasks that are setting auto commit to false?

The table monitor thread is in essentially the same relationship to this task as other tasks from the JDBC connector, and we don't see any issues from this running in production.
This is already effectively tested by running an existing source and sink connector in the same JVM, since sink connectors disable autocommit.

javabrett · 2019-12-07T08:03:52Z

src/main/java/io/confluent/connect/jdbc/dialect/PostgreSqlDatabaseDialect.java

   * @param stmt the prepared statement; never null
   * @throws SQLException the error that might result from initialization
   */
  @Override
  protected void initializePreparedStatement(PreparedStatement stmt) throws SQLException {
    log.trace("Initializing PreparedStatement fetch direction to FETCH_FORWARD for '{}'", stmt);
    stmt.setFetchDirection(ResultSet.FETCH_FORWARD);
+    stmt.setFetchSize(config.getInt(JdbcSourceConnectorConfig.BATCH_MAX_ROWS_CONFIG));


For Postgres specifically, the original issue found that there was a Postgres JDBC URL parameter defaultRowFetchSize which could be set, but was ineffective since it relied on autoCommit being disabled, which could not be done with the URL and is only settable in-code.

This change sets fetchSize in code (unconditionally), based on the existing BATCH_MAX_ROWS_CONFIG, which will default to 100 if absent.

A couple of questions:

This is only being added for Postgres, where there is the URL option defaultRowFetchSize - do other databases, which may or may not have a JDBC URL fetch-size option, possibly also suffer from this issue? Should we check for any other common databases that could have this issue and no URL option?

Does it matter that this will now always silently-ignore any URL defaultRowFetchSize set - might be confusing?

As far as I understand, MySQL has a similar behavior, except that their url workaround is effective. I'm sure that theres at least one other database that we don't know about that has this same issue, and would require the same fix.

However, without the abillity to integration test this functionality, I don't feel comfortable making changes to those dialects. Even for Postgres, I needed to find a way to integration test this change before I felt comfortable putting this up for PR. The cost to adding integration tests at this time is also higher than we can afford right now, and beyond the scope of this fix.

If someone included this URL option without knowing, then I don't think they would know/care that it was being ignored. Therefore, I'll assume that someone using this URL option put it there explicitly.

If the URL option is higher than the batch size (perhaps by someone trying to optimize throughput) then it would be effectively the same as if it was equal the batch size except with higher memory usage. This is because the connector will still only read at most batch.size of data from the result set regardless of the fetch size.

If the URL option was lower than the batch size (perhaps by someone trying to optimize memory) then they would notice that it didn't reduce memory usage. This would be non-obvious, but if they're optimizing memory usage, why would they change this URL parameter before reducing the batch size? Tuning the batch size performs effectively the same memory tuning behavior that they want, but it's actually supported by the connector itself, and documented behavior, rather than depending on this pass-through driver tuning parameter.

src/main/java/io/confluent/connect/jdbc/source/JdbcSourceTask.java

…-nullable check Signed-off-by: Greg Harris <gregh@confluent.io>

Signed-off-by: Greg Harris <gregh@confluent.io>

javabrett

I remain uneasy about the setFetchSize logic being applied only for Postgres, or it being only possible to apply for Postgres. If there's a concern that this might not be safe to apply to all databases/drivers, then we could find a way to not auto-apply it from batch.size (although that logic is attractive). I guess I'm curious - what is the effective batch and fetch size and memory consumption of other databases ... maybe they are automatically using forward-scrolling cursors by-default?

It feels like the autoCommit change could be safely applied by making this a new option, then any behavioural change could be isolated to situations where it is required, e.g. Postgres, where it would be documented.

I think the change is good and the above thoughts are take-or-leave only.

src/main/java/io/confluent/connect/jdbc/source/TableQuerier.java

Signed-off-by: Greg Harris <gregh@confluent.io>

gharris1727 · 2019-12-16T19:01:32Z

There are a few bug fixes/features which have added configuration options:

delete.enabled CC-349: Added delete support for sink. #641
This is arguable, since perhaps someone was relying on the error-on-null-value behavior, or wants to ensure that data is not deleted from their target system under any circumstances
quote.sql.identifiers CC-3638: Added logic to optionally quote identifiers in SQL statements #572
This could have been implemented without a config, but was included in order to maintain prior behavior for existing databases where data has already been transferred.
db.timezone Added db.timezone setting #505
This definitely needs a configuration to give information about the database
connection.attempts, connection.backoff.ms, Issue #198: Improved the JdbcSourceConnector to retry JDBC connection when needed on startup #203
This is a retry "feature", and it's reasonable to configure these to be more responsive or forgiving to network conditions
numeric.precision.mapping Handle numeric types #177 numeric.mapping CC-1843: Add config option for extended handling of numeric data type #396
This could have been implemented without a config, but was included in order to maintain prior behavior for existing databases where data has already been transferred.
schema.pattern Ability to set the schema pattern for tables metadata retrieval #109
This definitely needs a configuration to give information about the database
validate.non.null Add config to disable non-null checks. #51
I think this shouldn't have been a configuration, and we should have just demoted this from an error to a warning. This is an example of a customer-specific configuration that isn't healthy long-term.

These are just the bug fixes that resulted in a configuration option, but there are many more that were fixed without a configuration option, many more than are listed here.

Overall, I think that the only justification for a configuration added as a bug fix is preservation of backwards compatibility, and I don't think that justification works here. Nobody is relying on the OOM exception as part of their workflow, and it has no impact on the actual data that's being transferred.

This is just the first dialect to have this fix applied, and applying it to each of them is simple. The only reason we're not adding them at this time is because of the cost of testing.
The last thing that I'd want is to add a configuration that is essentially a "throw OOM exception enable/disable" option, only to have it become obsolete after we make an investment and figure out which dialects can support it.

Let me know your thoughts @javabrett, and thanks for your review!

yohei1126 · 2020-01-16T00:11:16Z

Hi, when new Docker images includes this fix is published to Docker hub?
https://hub.docker.com/r/confluentinc/cp-kafka-connect/tags
It looks like latest and 5.4.0 tags are already updated in a day ago but I do not know these tags include this fix.

gharris1727 · 2020-01-16T00:42:55Z

This will not be released until the next patch release, so 5.3.3, 5.4.1, 5.5.0, or 6.0.0.
It will appear in docker images once those versions are released.
Until then you will need to build the connector from source in order to use this fix.

kgrimsby · 2020-02-05T14:08:15Z

Is this included in the latest confluent 5.4 deb packages?

gharris1727 · 2020-02-05T16:41:25Z

@kgrimsby No, this will not be included in the 5.4 deb packages. It will be included in the deb packages of the versions I called out in my previous message.

kgrimsby · 2020-02-06T08:56:47Z

My bad @gharris1727 ! Sorry for wasting your time..

gharris1727 added 3 commits December 5, 2019 14:54

Add out-of-memory tests for postgres streaming reads

6f7a61f

Signed-off-by: Greg Harris <gregh@confluent.io>

Disable autocommit for source task and add explicit commit logic

6950cd2

Signed-off-by: Greg Harris <gregh@confluent.io>

In Posgres, add batch size limit for streaming reads

3dfe113

Signed-off-by: Greg Harris <gregh@confluent.io>

gharris1727 requested a review from a team December 5, 2019 23:10

gharris1727 changed the base branch from master to 5.0.x December 5, 2019 23:11

levzem reviewed Dec 5, 2019

View reviewed changes

src/main/java/io/confluent/connect/jdbc/source/TableQuerier.java Outdated Show resolved Hide resolved

src/main/java/io/confluent/connect/jdbc/source/JdbcSourceTask.java Outdated Show resolved Hide resolved

Leave autoCommit enabled for validateNonNullable during task start

c2cee5b

Signed-off-by: Greg Harris <gregh@confluent.io>

gharris1727 requested a review from levzem December 5, 2019 23:39

levzem approved these changes Dec 6, 2019

View reviewed changes

javabrett self-requested a review December 7, 2019 01:20

wicknicks reviewed Dec 7, 2019

View reviewed changes

javabrett requested changes Dec 8, 2019

View reviewed changes

gharris1727 added 3 commits December 10, 2019 15:49

Disable auto-commit for whole task, and temporarily enable it for non…

9d920e5

…-nullable check Signed-off-by: Greg Harris <gregh@confluent.io>

Fix test failures

f6ccca8

Signed-off-by: Greg Harris <gregh@confluent.io>

Merge branch '5.0.x' into postgres-oom

e198ce6

gharris1727 requested review from javabrett and wicknicks December 11, 2019 17:30

javabrett approved these changes Dec 13, 2019

View reviewed changes

src/main/java/io/confluent/connect/jdbc/source/TableQuerier.java Outdated Show resolved Hide resolved

Add warn message when database commit fails

57c3821

Signed-off-by: Greg Harris <gregh@confluent.io>

gharris1727 merged commit 6b80b69 into confluentinc:5.0.x Jan 15, 2020

gharris1727 deleted the postgres-oom branch January 15, 2020 20:43

This was referenced Jan 20, 2020

kafka-connect-jdbc source connector OOM #705

Closed

java.lang.OutOfMemoryError #34

Closed

gharris1727 mentioned this pull request Jan 31, 2020

Fix OOM errors in PostgreSQL dialect #466

Closed

Fix Postgres OOM by limiting fetch size #758

Fix Postgres OOM by limiting fetch size #758

Uh oh!

Conversation

gharris1727 commented Dec 5, 2019

Uh oh!

levzem left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

levzem left a comment

Choose a reason for hiding this comment

Uh oh!

gharris1727 commented Dec 6, 2019

Uh oh!

wicknicks left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wicknicks Dec 7, 2019

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 11, 2019

Choose a reason for hiding this comment

Uh oh!

javabrett Dec 7, 2019

Choose a reason for hiding this comment

Uh oh!

gharris1727 Dec 11, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

javabrett left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gharris1727 commented Dec 16, 2019

Uh oh!

yohei1126 commented Jan 16, 2020

Uh oh!

gharris1727 commented Jan 16, 2020

Uh oh!

kgrimsby commented Feb 5, 2020

Uh oh!

gharris1727 commented Feb 5, 2020

Uh oh!

kgrimsby commented Feb 6, 2020

Uh oh!

Uh oh!