Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Fix retry behavior for Jdbc Client #7561

Closed

Conversation

amogh-jahagirdar
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar commented May 8, 2023

Fixes #7173

There are certain SqlExceptions which are currently not being retried in JdbcClient but they should be retried based on their SqlState.

Although this implementation is really only an incremental improvement, because currently the retry behavior in ClientPoolImpl is hardcoded to retry only once after the initial request and without any backoff. Also currently, only a single exception can be passed in for retries which is a bit limiting.

I think there's a good amount of room for refactoring ClientPoolImpl to enable custom amount of retries and multiple exceptions similar to our Tasks implementation.

Right now, the way retry can be tuned is by overriding isConnectionException (same as what is done in HiveClientPoolImpl). I think we can take refactoring separately since it'll be a bit more intrusive and in the interim at least we can incrementally improve the behavior.

Furthermore, the exception that is currently being retried SqlNonTransientConnectionException should not be retried. The client should be retrying SqlTransientConnectionException, so this behavior has been fixed.

cc: @jackye1995 @singhpk234 @nastra @szehon-ho

@github-actions github-actions bot added the core label May 8, 2023
Comment on lines +37 to +38
private static final Set<String> RETRYABLE_CONNECTION_SQL_STATES =
ImmutableSet.of("08000", "08003", "08006", "08004", "08007");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should probably link what these SqlStates are in a code comment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for including

  • 08004 As per doc look like it's due to sql rejecting the connection. refering this doc of pg-jdbc driver mentioned in description above doc :
   * The server rejected our connection attempt. Usually an authentication failure, but could be a
   * configuration error like asking for a SSL connection with a server that wasn't built with SSL
   * support.
   */
  CONNECTION_REJECTED("08004"), 

Copy link
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, would be great to add comments about the meaning of error codes, not sure if there are any library to directly import these values. Also can we add some tests?

Copy link
Contributor

@singhpk234 singhpk234 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the change and making client pool more configurable, linking previous discussions on the same as well #3099

Comment on lines +37 to +38
private static final Set<String> RETRYABLE_CONNECTION_SQL_STATES =
ImmutableSet.of("08000", "08003", "08006", "08004", "08007");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for including

  • 08004 As per doc look like it's due to sql rejecting the connection. refering this doc of pg-jdbc driver mentioned in description above doc :
   * The server rejected our connection attempt. Usually an authentication failure, but could be a
   * configuration error like asking for a SSL connection with a server that wasn't built with SSL
   * support.
   */
  CONNECTION_REJECTED("08004"), 

@cccs-eric
Copy link
Contributor

Me and my co-worker just raised in Iceberg General Slack the same issue that this PR is fixing. Any timeline on when this could be merged in? And also, will this fix work for when a Postgresql database has closed an idle connection (so a reconnect will be done)?

cccs-eric added a commit to CybercentreCanada/iceberg that referenced this pull request Jun 12, 2023
cccs-eric added a commit to CybercentreCanada/iceberg that referenced this pull request Jun 12, 2023
@cccs-eric
Copy link
Contributor

I tested this commit inside our fork to see if it would fix our idle connection timeout problem with Postgresql. Turns out the SQLState returned by Postgres (57000) is not part of the SQL standard. Here is the list of possible error codes in Postgresql.

I think the current PR is a step in the right direction, but limiting the support to a handful of standard SQL states will not address every use-case (read database vendor). Since the JDBCCatalog is written to work with any database supported by JDBC, we need to provide a configuration option to extend the list of SQLState related to retryable connection exceptions.

The configurable list would be vendor specific since for example 57000 is a retryable exception for Postgresql, but maybe not for Oracle or MySQL. SQLStates are not vendor assigned, so they can be reused.

@amogh-jahagirdar , do you think it would make sense to add a configurable option for a list of retryable connection exception codes?

cccs-eric added a commit to CybercentreCanada/iceberg that referenced this pull request Jun 22, 2023
cccs-eric added a commit to CybercentreCanada/iceberg that referenced this pull request Jun 23, 2023
@cccs-jc
Copy link
Contributor

cccs-jc commented Nov 2, 2023

@amogh-jahagirdar, Is there any progress on making the error codes configurable. This would avoid forks just to add code specific to for example posgresql.

@cccs-br
Copy link

cccs-br commented Feb 5, 2024

Since the JdbcCatalog provides the means to specify your own JdbcClientPool by providing a client pool builder, one could achieve the desired behavior by extending the JdbcCatalog and overriding the constructor without requiring any direct changes to the current implementation of the ClientPoolImpl.

For example, something along those line:

package org.abc.catalogs;

import java.sql.SQLException;
import java.util.List;
import java.util.Map;

import org.apache.iceberg.jdbc.JdbcCatalog;
import org.apache.iceberg.jdbc.JdbcClientPool;

public class CustomJdbcCatalog extends JdbcCatalog {

    private static final List<String> RETRY_EXCEPTION_STATE = List.of("08006", "57000");

    public CustomJdbcCatalog() {
        super(null, (Map<String, String> props) -> {
            String uri = props.get("uri");
            return new JdbcClientPool(uri, props) {
                @Override
                protected boolean isConnectionException(Exception exc) {
                    boolean isRetry = super.isConnectionException(exc);
                    if (!isRetry && exc instanceof SQLException sqle) {
                        isRetry = RETRY_EXCEPTION_STATE.contains(sqle.getSQLState());
                    }
                    return isRetry;
                }
            };
        }, false);
    }
}

This gives you full control over the logic of the retry.

Then, if in the context of spark, you could simply specify org.abc.catalogs.CustomJdbcCatalog for the spark.sql.catalog.<catalog_name>.catalog-impl parameter.

Of course this is more complex than having the means to "configure" the retry logic, but it gets you where you are trying to go.

@jean-humann
Copy link

@nastra @szehon-ho Is there anything we can do to push this, please ?
Currently, we are forking Iceberg internally just to add this part to use Iceberg Rest Catalog with JDBC to workout properly.

@amogh-jahagirdar
Copy link
Contributor Author

Hey all @jean-humann @cccs-br @cccs-eric @cccs-jc sorry I forgot to follow up on this PR! I'm taking a look now to see how to carry this forward. It does seem like since there are codes which will not be part of the standard spec, but that still need to be retried , that we need a configurable list (since we wouldn't want to maintain such a list in Iceberg)

@amogh-jahagirdar
Copy link
Contributor Author

@jean-humann @cccs-br @cccs-eric @cccs-jc #10140 this is a draft PR for having a configurable set of retry codes. I'll need to write some tests for it but just a heads up since I don't think I'll be updating this branch, and would be closing this PR if there's consensus from others that having a user provided error code list is a good idea to solve the problems you all are facing! I think it is the most practical option at the moment until we separate catalogs from the reference implementation in the future https://www.youtube.com/watch?v=uAQVGd5zV4I&t=1323s

Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 30, 2024
@jfuerth
Copy link

jfuerth commented Aug 30, 2024

My team hit this issue specifically with PostgreSQL today. Would love it if we could at least get something in that works for Postgres in particular, since that seems to be the most common thing that brings people here.

Please let me know how we can help move this forward!

Edit: I see #10140 is an updated version of this PR and it has been merged. Thanks @amogh-jahagirdar for seeing this through! Much appreciated.

@github-actions github-actions bot removed the stale label Aug 31, 2024
Copy link

github-actions bot commented Oct 4, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Oct 4, 2024
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Does JDBC connector uses any retry mechanism?
8 participants