Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In FlyteRemote.upload_file, pass the file object directly rather than the entire bytes buffer #2641

Merged
merged 19 commits into from
Aug 6, 2024

Conversation

redartera
Copy link
Contributor

@redartera redartera commented Aug 1, 2024

Why are the changes needed?

In the upload_file function, we read the entire file's content and hold onto it in memory - this is problematic when dealing with large files. Moreover, there is an issue where requests.put can raise an OverflowError exception when dealing with large byte buffers.

For a demo replicating the issue and demonstrating the fix, please refer to this comment.

Setup: Python 3.9, flytekit 1.12.3 and Ubuntu 22.04

What changes were proposed in this pull request?

The proposed changes follow the recommendation from the official requests documentation

Additionally, we add an integration test for FlyteRemote.upload_file that is run separately from the rest of our integration tests, using a new lftransfers pytest marker. Changes were made to our Makefile, our pythonbuild.yml as well as the markers section of pyproject.toml accordingly.

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs


Signed-off-by: Reda Oulbacha <reda@artera.ai>
@redartera
Copy link
Contributor Author

any recommendations on how this change might be tested in a more systematic way?

@pingsutw
Copy link
Member

pingsutw commented Aug 1, 2024

@redartera Thank you so much for fixing it.

any recommendations on how this change might be tested in a more systematic way?

We can register a task with a large file in the integration test.

def register():
out = subprocess.run(
[
"pyflyte",
"--verbose",
"-c",
CONFIG,
"register",
"--image",
IMAGE,
"--project",
PROJECT,
"--domain",
DOMAIN,
"--version",
VERSION,
MODULE_PATH,
]
)

btw, this is awesome. I didn't know request supports streaming uploads. It will reduce lots of memory overhead at compile time.

Copy link

codecov bot commented Aug 2, 2024

Codecov Report

Attention: Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.

Project coverage is 78.47%. Comparing base (72da0d0) to head (c5075ae).
Report is 4 commits behind head on master.

Files Patch % Lines
flytekit/remote/remote.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2641      +/-   ##
==========================================
- Coverage   78.77%   78.47%   -0.31%     
==========================================
  Files         187      187              
  Lines       19149    19151       +2     
  Branches     3993     3994       +1     
==========================================
- Hits        15085    15029      -56     
- Misses       3374     3436      +62     
+ Partials      690      686       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

redartera and others added 3 commits August 2, 2024 01:08
Signed-off-by: Reda Oulbacha <reda@artera.ai>
Signed-off-by: Reda Oulbacha <reda@artera.ai>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
@redartera redartera marked this pull request as draft August 2, 2024 03:15
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
@redartera redartera marked this pull request as ready for review August 2, 2024 03:24
@redartera
Copy link
Contributor Author

redartera commented Aug 2, 2024

@redartera Thank you so much for fixing it.

any recommendations on how this change might be tested in a more systematic way?

We can register a task with a large file in the integration test.

def register():
out = subprocess.run(
[
"pyflyte",
"--verbose",
"-c",
CONFIG,
"register",
"--image",
IMAGE,
"--project",
PROJECT,
"--domain",
DOMAIN,
"--version",
VERSION,
MODULE_PATH,
]
)

btw, this is awesome. I didn't know request supports streaming uploads. It will reduce lots of memory overhead at compile time.

sounds good - I added a test which writes a large file to a module then runs pyflyte register ... on that module.

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
@redartera redartera force-pushed the stream-upload-file-flyteremote branch from 5f68f36 to 618c067 Compare August 2, 2024 13:21
@redartera
Copy link
Contributor Author

@redartera Thank you so much for fixing it.

any recommendations on how this change might be tested in a more systematic way?

We can register a task with a large file in the integration test.

def register():
out = subprocess.run(
[
"pyflyte",
"--verbose",
"-c",
CONFIG,
"register",
"--image",
IMAGE,
"--project",
PROJECT,
"--domain",
DOMAIN,
"--version",
VERSION,
MODULE_PATH,
]
)

btw, this is awesome. I didn't know request supports streaming uploads. It will reduce lots of memory overhead at compile time.

sounds good - I added a test which writes a large file to a module then runs pyflyte register ... on that module.

seems like the GH runners are running out of disk space, I reduced the file size to 2 and 3gb while adding the @pytest.mark.serial marker to avoid running these tests in parallel and overloading the runner's disk.

pingsutw
pingsutw previously approved these changes Aug 2, 2024
Signed-off-by: Kevin Su <pingsutw@apache.org>
@eapolinario
Copy link
Collaborator

seems like the GH runners are running out of disk space, I reduced the file size to 2 and 3gb while adding the @pytest.mark.serial marker to avoid running these tests in parallel and overloading the runner's disk.

What if we added a separate pytest marker to run just this test?

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
@redartera
Copy link
Contributor Author

seems like the GH runners are running out of disk space, I reduced the file size to 2 and 3gb while adding the @pytest.mark.serial marker to avoid running these tests in parallel and overloading the runner's disk.

What if we added a separate pytest marker to run just this test?

I like this approach - will find a cleaner way to adopt it

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
@redartera
Copy link
Contributor Author

redartera commented Aug 5, 2024

@eapolinario @pingsutw I made a few major changes to both the implementation and the integration test:

  • I now use stat following the recommendation of @eapolinario - works as expected.
  • I now use a separate lftransfers pytest marker also following the recommendation of @eapolinario. Changes were made to the Makefile, pythonbuild.yml and pyproject.toml to allow those tests to be ran separately from the rest of the integration tests
  • The test itself has changed, where I now simply test the upload_file method, thus making it easier to clean all objects which were created by the test. If i were to register a module and upload 2/3gb files each time, it would be difficult to cleanly delete all artifacts which were generated by each test, and the sandbox volume of users or GH runners would quickly become overloaded if our large files are persisted across test runs.

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Copy link
Collaborator

@eapolinario eapolinario left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@eapolinario eapolinario merged commit 7b463da into flyteorg:master Aug 6, 2024
99 of 101 checks passed
mao3267 pushed a commit to mao3267/flytekit that referenced this pull request Aug 9, 2024
…an the entire bytes buffer (flyteorg#2641)

* pass the local file directly for streaming in FlyteRemote.upload_file

Signed-off-by: Reda Oulbacha <reda@artera.ai>

* ruff format

Signed-off-by: Reda Oulbacha <reda@artera.ai>

* add an integration test

Signed-off-by: Reda Oulbacha <reda@artera.ai>

* remove unnecessary len

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* redo registration in the integration test

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* fix misspel

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* run the integration test serially

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* disable agent

Signed-off-by: Kevin Su <pingsutw@apache.org>

* use os.stat instead of os.seek to determine content_length

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* rewrite tests only uploda a file, use a separate marker

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* parametrize integration test makefile cmd

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* add workflow_dispatch for debugging

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* replace trigger with push

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* remove trailing whitespaces

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* remove agent disabling

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* remove trailing debug CI trigger

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

* clean up botocore imports

Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>

---------

Signed-off-by: Reda Oulbacha <reda@artera.ai>
Signed-off-by: redartera <120470035+redartera@users.noreply.github.com>
Signed-off-by: Kevin Su <pingsutw@apache.org>
Co-authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: mao3267 <chenvincent610@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants