Skip to content

Fix Postgres OOM by limiting fetch size #758

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jan 15, 2020

Conversation

gharris1727
Copy link
Contributor

This PR adds an integration test with an embedded postgres instance.
There are three tests:

  1. Verify that the test setup produces OOM exceptions with large batch sizes.
  2. Verify that the test setup does not produce OOM exceptions with small batch sizes.
  3. Verify that tables are not left locked when autoCommit is disabled.

Test 3 addresses a side-effect of autoCommit that was called out in #466, where read-only queries with autoCommit disabled would leave open locks on the database. This test verifies that the JdbcSourceTask does not leave open locks from any operations.

#466 specifically observed that the TableMonitorThread would leave open locks if auto-commit was disabled on that connection. This PR circumvents that issue by leaving the TableMonitorThread connection as-is, with autoCommit enabled. There are no tests covering the Connector or TableMonitorThread, because the behavior of those classes is unaffected by these changes. A connector test would also have some extra complication as the TableMonitorThread is asynchronous.

Fixes #34 .

Signed-off-by: Greg Harris <gregh@confluent.io>
Signed-off-by: Greg Harris <gregh@confluent.io>
Signed-off-by: Greg Harris <gregh@confluent.io>
@gharris1727 gharris1727 requested a review from a team December 5, 2019 23:10
@gharris1727 gharris1727 changed the base branch from master to 5.0.x December 5, 2019 23:11
Copy link

@levzem levzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two minor comment but otherwise LGTM

Signed-off-by: Greg Harris <gregh@confluent.io>
@gharris1727 gharris1727 requested a review from levzem December 5, 2019 23:39
Copy link

@levzem levzem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gharris1727
Copy link
Contributor Author

Note that io.zonky.test:embedded-postgres is Apache 2.0 licensed.

@javabrett javabrett self-requested a review December 7, 2019 01:20
Copy link
Member

@wicknicks wicknicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, @gharris1727. some comments/questions.


task = new JdbcSourceTask();
task.start(props);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to test what happens when the table monitor thread is running in the same JVM as these tasks that are setting auto commit to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table monitor thread is in essentially the same relationship to this task as other tasks from the JDBC connector, and we don't see any issues from this running in production.
This is already effectively tested by running an existing source and sink connector in the same JVM, since sink connectors disable autocommit.

* @param stmt the prepared statement; never null
* @throws SQLException the error that might result from initialization
*/
@Override
protected void initializePreparedStatement(PreparedStatement stmt) throws SQLException {
log.trace("Initializing PreparedStatement fetch direction to FETCH_FORWARD for '{}'", stmt);
stmt.setFetchDirection(ResultSet.FETCH_FORWARD);
stmt.setFetchSize(config.getInt(JdbcSourceConnectorConfig.BATCH_MAX_ROWS_CONFIG));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Postgres specifically, the original issue found that there was a Postgres JDBC URL parameter defaultRowFetchSize which could be set, but was ineffective since it relied on autoCommit being disabled, which could not be done with the URL and is only settable in-code.

This change sets fetchSize in code (unconditionally), based on the existing BATCH_MAX_ROWS_CONFIG, which will default to 100 if absent.

A couple of questions:

  • This is only being added for Postgres, where there is the URL option defaultRowFetchSize - do other databases, which may or may not have a JDBC URL fetch-size option, possibly also suffer from this issue? Should we check for any other common databases that could have this issue and no URL option?
  • Does it matter that this will now always silently-ignore any URL defaultRowFetchSize set - might be confusing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. As far as I understand, MySQL has a similar behavior, except that their url workaround is effective. I'm sure that theres at least one other database that we don't know about that has this same issue, and would require the same fix.

    However, without the abillity to integration test this functionality, I don't feel comfortable making changes to those dialects. Even for Postgres, I needed to find a way to integration test this change before I felt comfortable putting this up for PR. The cost to adding integration tests at this time is also higher than we can afford right now, and beyond the scope of this fix.

  2. If someone included this URL option without knowing, then I don't think they would know/care that it was being ignored. Therefore, I'll assume that someone using this URL option put it there explicitly.

    If the URL option is higher than the batch size (perhaps by someone trying to optimize throughput) then it would be effectively the same as if it was equal the batch size except with higher memory usage. This is because the connector will still only read at most batch.size of data from the result set regardless of the fetch size.

    If the URL option was lower than the batch size (perhaps by someone trying to optimize memory) then they would notice that it didn't reduce memory usage. This would be non-obvious, but if they're optimizing memory usage, why would they change this URL parameter before reducing the batch size? Tuning the batch size performs effectively the same memory tuning behavior that they want, but it's actually supported by the connector itself, and documented behavior, rather than depending on this pass-through driver tuning parameter.

…-nullable check

Signed-off-by: Greg Harris <gregh@confluent.io>
Signed-off-by: Greg Harris <gregh@confluent.io>
Copy link
Member

@javabrett javabrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remain uneasy about the setFetchSize logic being applied only for Postgres, or it being only possible to apply for Postgres. If there's a concern that this might not be safe to apply to all databases/drivers, then we could find a way to not auto-apply it from batch.size (although that logic is attractive). I guess I'm curious - what is the effective batch and fetch size and memory consumption of other databases ... maybe they are automatically using forward-scrolling cursors by-default?

It feels like the autoCommit change could be safely applied by making this a new option, then any behavioural change could be isolated to situations where it is required, e.g. Postgres, where it would be documented.

I think the change is good and the above thoughts are take-or-leave only.

Signed-off-by: Greg Harris <gregh@confluent.io>
@gharris1727
Copy link
Contributor Author

There are a few bug fixes/features which have added configuration options:

These are just the bug fixes that resulted in a configuration option, but there are many more that were fixed without a configuration option, many more than are listed here.

Overall, I think that the only justification for a configuration added as a bug fix is preservation of backwards compatibility, and I don't think that justification works here. Nobody is relying on the OOM exception as part of their workflow, and it has no impact on the actual data that's being transferred.

This is just the first dialect to have this fix applied, and applying it to each of them is simple. The only reason we're not adding them at this time is because of the cost of testing.
The last thing that I'd want is to add a configuration that is essentially a "throw OOM exception enable/disable" option, only to have it become obsolete after we make an investment and figure out which dialects can support it.

Let me know your thoughts @javabrett, and thanks for your review!

@gharris1727 gharris1727 merged commit 6b80b69 into confluentinc:5.0.x Jan 15, 2020
@gharris1727 gharris1727 deleted the postgres-oom branch January 15, 2020 20:43
@yohei1126
Copy link

Hi, when new Docker images includes this fix is published to Docker hub?
https://hub.docker.com/r/confluentinc/cp-kafka-connect/tags
It looks like latest and 5.4.0 tags are already updated in a day ago but I do not know these tags include this fix.

@gharris1727
Copy link
Contributor Author

This will not be released until the next patch release, so 5.3.3, 5.4.1, 5.5.0, or 6.0.0.
It will appear in docker images once those versions are released.
Until then you will need to build the connector from source in order to use this fix.

@kgrimsby
Copy link

kgrimsby commented Feb 5, 2020

Is this included in the latest confluent 5.4 deb packages?

@gharris1727
Copy link
Contributor Author

@kgrimsby No, this will not be included in the 5.4 deb packages. It will be included in the deb packages of the versions I called out in my previous message.

@kgrimsby
Copy link

kgrimsby commented Feb 6, 2020

My bad @gharris1727 ! Sorry for wasting your time..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants