-
Notifications
You must be signed in to change notification settings - Fork 0
Parallel download #85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* reduces download from 30s to 14s in simple testing
Codecov ReportAttention: Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
src/entitysdk/client.py
Outdated
for path in contents.files: | ||
paths.append( | ||
self.download_file( | ||
with concurrent.futures.ThreadPoolExecutor(max_workers=max_concurrent) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you move the parallel logic into entitysdk/utils/parallel.py
?
then you could. have sth like map_threads(it, max_workers=n)
where:
it = (self.download_file(..., asset_path=path) for path in content_files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to leave it where it is; this isn't a map function, btw, as the returns aren't in a particular order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad choice of a name then. But I still wouldn't want to cram things that can be separated in one place, if possible.
map_unordered_threads maybe?
@@ -349,8 +350,9 @@ def download_directory( | |||
output_path: os.PathLike, | |||
project_context: ProjectContext | None = None, | |||
ignore_directory_name: bool = False, | |||
max_concurrent: int = 4, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make it serial by default to avoid surprises?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of having max_concurrent
not use threads at all, in case it becomes a problem, but I don't forsee it being one (famous last words.)
I think doing it in parallel by default is worthwhile, personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be aware, if we integrate this in a service using fastapi (obi-one), and we have N concurrent requests served by sync endpoints, then we'll create up to N*max_concurrent
additional threads (besides the N needed by the fastapi pool, 40 by default according to https://www.starlette.io/threadpool/).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think that's a problem?
obi-one can limit the max_concurrent, if it wants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but bringing a service to a halt/crash in order to realize that max_concurrent has to be lowered in download_directory is not great, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be surprised if threads doing downloads is what pushes it over the edge; assuming all 40 threads are spawning another 4, that's a limit of 160 - which is high, but not unheard of.
No description provided.