Skip to content

Parallel download #85

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Parallel download #85

wants to merge 4 commits into from

Conversation

mgeplf
Copy link
Contributor

@mgeplf mgeplf commented Jun 25, 2025

No description provided.

Copy link

codecov bot commented Jun 25, 2025

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/entitysdk/client.py 66.66% 1 Missing and 1 partial ⚠️
Flag Coverage Δ
pytest 99.83% <66.66%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/entitysdk/client.py 98.67% <66.66%> (-1.33%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@mgeplf mgeplf requested a review from eleftherioszisis June 25, 2025 11:10
for path in contents.files:
paths.append(
self.download_file(
with concurrent.futures.ThreadPoolExecutor(max_workers=max_concurrent) as executor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you move the parallel logic into entitysdk/utils/parallel.py?

then you could. have sth like map_threads(it, max_workers=n)

where:

it = (self.download_file(..., asset_path=path) for path in content_files)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to leave it where it is; this isn't a map function, btw, as the returns aren't in a particular order.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad choice of a name then. But I still wouldn't want to cram things that can be separated in one place, if possible.

map_unordered_threads maybe?

@@ -349,8 +350,9 @@ def download_directory(
output_path: os.PathLike,
project_context: ProjectContext | None = None,
ignore_directory_name: bool = False,
max_concurrent: int = 4,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make it serial by default to avoid surprises?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of having max_concurrent not use threads at all, in case it becomes a problem, but I don't forsee it being one (famous last words.)
I think doing it in parallel by default is worthwhile, personally.

Copy link
Collaborator

@GianlucaFicarelli GianlucaFicarelli Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be aware, if we integrate this in a service using fastapi (obi-one), and we have N concurrent requests served by sync endpoints, then we'll create up to N*max_concurrent additional threads (besides the N needed by the fastapi pool, 40 by default according to https://www.starlette.io/threadpool/).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think that's a problem?
obi-one can limit the max_concurrent, if it wants

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but bringing a service to a halt/crash in order to realize that max_concurrent has to be lowered in download_directory is not great, is it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be surprised if threads doing downloads is what pushes it over the edge; assuming all 40 threads are spawning another 4, that's a limit of 160 - which is high, but not unheard of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants