Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TDL 7402 Add retry logic for 500 errors #90

Conversation

prijendev
Copy link
Contributor

@prijendev prijendev commented Oct 12, 2021

Description of change

  • Added retry logic for following below errors,
    • ConnectionResetError
    • socket.timeout
    • HTTP Error 408: Request Timeout
    • HTTP Error 500: Internal Server Error
    • URL Error
    • ssl.SSLEOFError
    • 408 Exception
  • Retry above all errors until 60 seconds.
  • If after passing 60 seconds the same error comes, then it will be raised.

Manual QA steps

  • Raise the following error manually and check that it retries until 60 seconds.

Risks

Rollback steps

  • revert this branch

@singer-io singer-io deleted a comment from prijendev Nov 10, 2021
"""
Retry following errors for 60 seconds,
socket.timeout, ConnectionError, internal server error(500-range), SSLError, URLError, HTTPError(408), Transport errors.
Raise the error direclty for all errors except mentioned above errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Raise the error direclty for all errors except mentioned above errors.
Raise the error directly for all errors except mentioned above errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

"""
Retry following errors for 60 seconds,
socket.timeout, ConnectionError, internal server error(500-range), SSLError, HTTPError(408), Transport error.
Raise the error direclty for all errors except mentioned above errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Raise the error direclty for all errors except mentioned above errors.
Raise the error directly for all errors except mentioned above errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -650,7 +704,8 @@ async def poll_report(client, account_id, report_name, start_date, end_date, req
report_name,
start_date,
end_date))
response = client.PollGenerateReport(request_id)
# As in async method backoff does not work directly we created seprate method to handle it.
Copy link

@savan-chovatiya savan-chovatiya Dec 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# As in async method backoff does not work directly we created seprate method to handle it.
# As in the async method backoff does not work directly we created a separate method to handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

after_time = datetime.datetime.now()
time_difference = (after_time - before_time).total_seconds()
# verify the code backed off for 60 seconds
# time_difference should be greater or equall 60 as some time elapsed while calculating `after_time`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# time_difference should be greater or equall 60 as some time elapsed while calculating `after_time`
# time_difference should be greater or equal 60 as some time elapsed while calculating `after_time`

This typo is present at multiple places so replace it with equal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

'''
Test that tap retry for 60 seconds on the Transport error.
'''
with open('tests/base.py') as f:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file handler is unused in this test and also some other tests too. Please remove those.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

return True
elif isinstance(exception, suds.transport.TransportError) or isinstance(exception, socket.timeout):
return True
elif type(exception) == URLError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to retry for all URLErrors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Because here suds SDK is used and in some methods, it internally catches many errors including connection reset error, timeout error, etc, and raises URLError of it.

return True
elif type(exception) == URLError:
return True
elif type(exception) == Exception and exception.args[0][0] == 408:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add what is 408 and why we are retrying in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 408 Request Timeout is an HTTP response status code that indicates the server didn't receive a complete request message within the server's allotted timeout period. A suds SDK catches HTTPError with status code 408 and raises Exception: (408, 'Request Timeout'). That's why we retrying this error also. Added comment in the code, also.

return True
elif type(exception) == Exception and exception.args[0][0] == 408:
return True
elif exception.code == 408:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge this one and above condition with or

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged it.

self.error = error
self.count = 0

def GetCampaignsByAccountId(self, AccountId, CampaignType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a small comment for each function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comments.

try:
if isinstance(exception, ConnectionError) or isinstance(exception, ssl.SSLError):
return True
elif isinstance(exception, suds.transport.TransportError) or isinstance(exception, socket.timeout):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we merge the elif statements in the if statement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged elif statements in the if statement condition except condition for error code of 408 because of better readability.

(Exception),
giveup=lambda e: not should_retry_httperror(e),
max_time=60, # 60 seconds
factor=2)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@prijendev How many times will the exception retry because of this backoff?
Can you please add comment above this exception regarding the behavior of backoff mechanism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, we have implemented backoff based on max_time. It retries errors until 60 seconds(7 or 8 times retry perform). If after passing 60 seconds the same error comes, then it will be raised. Added comment in the code also.

Copy link
Contributor

@kspeer825 kspeer825 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion for the try catch pattern.

Comment on lines 74 to 77
try:
tap_bing_ads.sync_accounts_stream(['i1'], {})
except ConnectionResetError:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this pattern could produce a false negative, if the method were to take more than 60 seconds and succeed rather than raise the specified error. I think this can be fixed by raising an assertion error inside the try after the target method is called.

@prijendev prijendev merged commit 2dda2c2 into TDL-10048-Improve-BingAds-Reliability Dec 21, 2021
KrisPersonal pushed a commit that referenced this pull request Dec 29, 2021
* TDL-16888 Added code comments (#88)

* Added code comments

* Updated code comments

* UPdated comment

* TDL 7402 Add retry logic for 500 errors (#90)

* Initial commit for backoff error handling

* Added usage of error handling function

* Added detail comment for unittest cases

* Added coverage report to artifact

* added backoff for Expection(408)

* Updated backoff for async method

* updated backoof for poll_report

* Updated max_time

* Updated unittest cases

* removed Exception from backoff

* Updated backoff

* resolved unittest case errro

* Updated typo in comments

* Added test cases for Exception with error code 408

* Added comment for 408 error

* Added comments

* Updated unit test cases for 400 error

* Updated unit test case for create sdk client

* Merged elif condition

* Updated unit test cases

* Tdl 9721 missing top level breadcrumb for all streams (#91)

* TDL-9721: Updated metadata creation method and added top level breadcrum and un commented the breadcrum test case code.

* TDL-9721: Added top level breadcrumb.

* TDL-9721: Commented the code wich uncommented by mistake and failing tests

* TDL-9721: Removed unused changes

* TDL-9721: Updated automatic fields for report streams

* TDL-9721: Optimized the code of making replication key automatic

* TDL-9721: Added Unit tests with coverage

* TDL-9721: Added pip install nose

* TDL-9721: Updated file name

* TDL-9721: Updated config.yml

* TDL-9721: Added missing fieldExclusion

* TDL-9721: Added code comment

* TDL-9721: Updated the primary key of account stream as earlier.

* TDL-9721: Updated primary key for account stream in tests to make it same as catalog.

* TDL-9721: Updated primary key in catalog for account stream.

* TDL-9721: Updated the replication keys and test cases.

* TDL-9721: Updated replication key for discover mode only as other test using replication key.

* TDL-6737 added pylint and resolved the errors (#87)

* added pylint and resolved the errors

* added pylint in the setup.py

* corrected the pylint command

* added package name

* replaced dev with test in pip install

* resolved the comments

* removed f-string

* Added covergae report to artifact

* removed unittest case report

* removed config file changes

* revert back config.yml file changes

* Added back pylint in circleci

* resolved pylint error

* resolved comments

* removed str

* removed logger statements

* added consider-using-f-string in config.yml

* resolved comments

* resolved pylint errors

* pylint fixtures

Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com>

* TDL-15251: Add required fields for age_gender_audience_report (#86)

* added required fields for age_gender_audience_report

* Added coverage report to artifact

* revert back coverage report changes

* revert back config.yml file changes

* resolved comments

* added missing assertions

* resolved cci failures

Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasys.com>

* Tdl 7402 add retry logic for 500 errors (#96)

* Initial commit for backoff error handling

* Added usage of error handling function

* Added detail comment for unittest cases

* Added coverage report to artifact

* added backoff for Expection(408)

* Updated backoff for async method

* updated backoof for poll_report

* Updated max_time

* Updated unittest cases

* removed Exception from backoff

* Updated backoff

* resolved unittest case errro

* Updated typo in comments

* Added test cases for Exception with error code 408

* Added comment for 408 error

* Added comments

* Updated unit test cases for 400 error

* Updated unit test case for create sdk client

* Merged elif condition

* Updated unit test cases

* Tdl 15861 implement request timeout (#93)

* TDL-15861 implement request timeout

* added omment

* removed unused imports

* added comment

* resolved confilcts with the base

* resolved conflicts

* resolved conflicts

* resolved comments

* updated unittests

* added timeout for get calls

* corrected a typo in comment

* resolved comments

* resolved comments

Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com>

Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com>

* TDL-9902 removed the automatic fields from ad ext report (#85)

* TDL-9902 removed the automatic fields from ad ext report

* Added coverage report in artifact

* revert back config.yml file

* resolved comments

Co-authored-by: namrata270998 <namrata.brahmbhatt@crestdatasys.com>

Co-authored-by: prijendev <88327452+prijendev@users.noreply.github.com>
Co-authored-by: dbshah1212 <35164219+dbshah1212@users.noreply.github.com>
Co-authored-by: prijendev <prijen.khokhani@crestdatasys.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants