Skip to content

Bump aws-sdk-cpp to 1.18.105 #1226

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

vnghia
Copy link
Contributor

@vnghia vnghia commented Dec 14, 2020

This PR:

  • Bump aws-sdk-cpp to 1.18.105
  • Remove unused code for getting s3 region since it will be done automatically by aws-sdk-cpp
  • Enable ENABLE_CURL_LOGGING
  • Support more http code for TF_SetStatusFromAWSError
  • I am introducing a breaking change here. RecursivelyCreateDir now calls CreateDir instead of relying on TensorFlow fallback. ( Not really a breaking change though )

Part of #1183

Comment on lines 76 to 78
case Aws::Http::HttpResponseCode::REQUESTED_RANGE_NOT_SATISFIABLE:
TF_SetStatus(status, TF_OUT_OF_RANGE, "Read less bytes than requested");
break;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is correct or not since the old code already supported REQUESTED_RANGE_NOT_SATISFIABLE = 416. WDYT @yongtang ?

// SECTION 4. Implementation for `TF_Filesystem`, the actual filesystem
// ----------------------------------------------------------------------------
// TODO: use_multi_part_download=true will cause an issue when read request
// is asking for bytes larger than object size. In that case, a status code of
// 416 is returned. However, 416 is not handled correctly and s3 thought this
// is an error - should return OUT_OF_RANGE with less bytes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vnvo2409 The switch is controlled in:

use_multi_part_download(false), // TODO: change to true after fix

You can change the default value to true and see if the issue is resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am investigating this issue tensorflow/tensorflow#44564. Let's leave it for another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yongtang

tensorflow/tensorflow#44564 is resolved. We could enable this option

@vnghia
Copy link
Contributor Author

vnghia commented Dec 16, 2020

@yongtang
I could not understand why the test keeps falling with segfault and I could not reproduce it locally. Do you have some hints why it happens ? Thank you very much !

@yongtang
Copy link
Member

@vnvo2409 The crash could be caused by duplicate symbols between aws and arrow library? Though I am not quite sure. It seems to be happening on Linux only.

Could you try split the PR into two PRs with one PR for aws-sdk-cpp to 1.18.105 and another PR for other changes like Enable ENABLE_CURL_LOGGING, Support more http code for TF_SetStatusFromAWSError? This can help us locate the cause of the issue better.

In the meantime, I am trying to update the arrow library from 0.16.0 to 2.0.0 in #1231. We need to update arrow library anyway so I think it makes sense to see if updating arrow will resolve the aws+arrow issue.

@vnghia vnghia force-pushed the s3-improvements branch 2 times, most recently from 7ba27f5 to ad0194c Compare December 17, 2020 00:03
@vnghia
Copy link
Contributor Author

vnghia commented Dec 17, 2020

I already rebase to pickup the arrow update but it still failed. Do you have any idea what I should do next @yongtang ? Thank you very much !

@yongtang
Copy link
Member

@vnvo2409 The Linux build on GitHub Actions is in:
https://github.com/tensorflow/io/blob/master/.github/workflows/build.yml#L222-L227

There are two difference in between the GitHub Actions and the normal bazel build.

  1. GitHub Actions utilize the optimization mode with --copt=-msse4.2 --copt=-mavx --compilation_mode=opt passed to bazel
  2. GitHub Actions utilize the docker image from tensorflow's build image which is Ubuntu16.04+ customized gcc from developer toolset.

I tends to believe optimization (1) is the factor though I cannot exclude (2) yet. Will take a look.

@vnghia
Copy link
Contributor Author

vnghia commented Dec 19, 2020

0f3e1a4 fixes a ssl support problem on Mac because USE_DARWINSSL is renamed to USE_SECTRANSP https://github.com/curl/curl/blob/ccbdbe13c43f15a637afdd1ac4bf5df62c25a235/CMakeLists.txt#L365-L367

@vnghia
Copy link
Contributor Author

vnghia commented Dec 20, 2020

@yongtang

I build it separately and it seems fine without any issue. I doubt that it somehow conflicts with the s3 inside TensorFlow. On my machine, sometime it works and sometime it throws segfaul 😕. I am building a custom wheel without s3 to see if the conflict is the problem.

@yongtang
Copy link
Member

@vnvo2409 Ah that might explain the segment as there are actually two copies of AWS C++ sdk. One is in tensorflow and another is in tensorflow-io. And they are with different versions. If the AWS C++ SDK on both tensorflow-io and tensorflow are updated to the same version then it might be OK (as two copies with same version).. Or if we remove the one from tensorflow (so that there is only one version) then it might also be OK I guess.

@vnghia vnghia mentioned this pull request Dec 23, 2020
@kvignesh1420
Copy link
Member

@vnvo2409 the PR by @yongtang : (#1241) to exclude external libraries has been merged. Can you please check and proceed with this PR?

@vnghia
Copy link
Contributor Author

vnghia commented Feb 8, 2021

Sorry. I'm still stuck at the segfault. I think I will try this PR again when aws is completely removed out of Tensorflow.

@vnghia vnghia closed this Feb 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants