-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: performance of _ls_tree #2103
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hi @awgr, thanks a lot for your PR and clear profiling explanations! And thanks for finding out the deepcopy was the culprit here 😬 I'd like to take advantage of your PR to slightly refactor this part of the code here and in particular:
Would you like to do it? Otherwise, I am more than happy to take care of it myself if it's fine for you that I push changes to your branch? Thanks in advance! |
Ah feel free to push code to my branches haha, but if you don't have time I'll make an effort to take care of it later tonight. |
After reviewing this again I've realized that a slightly larger change is needed, beyond both the original change (which would've caused problems later) and what you've asked for above. |
Thanks for rechecking and letting me know @awgr! Let me know when the PR is suitable for a second review then :) If you have any question, feel free to ask. |
Sorry, not yet. I'd not had time, and then came back tonight to take another look. What I'd realized is that: I modified the code locally, leading to the improved profile. The modification did not ensure the dircache was safe from the caller modifying the returned values. After thinking more, I think there are two typical ways to prevent modification:
(1) is ideal, but I think the api components (LastCommitInfo, BlobLfsInfo, BlobSecurityInfo, ...) would need to be effectively immutable on create. I'm not sure what patterns in HF hub that would break... While I respect the intent of the defensive copy, it seems to be a large proportion of the computing; I wonder whether a caller might simply opt to disable it, under the reasoning that they assume responsibility for problems that arise from mutating the values returned by the API. What do you think? |
Thanks @awgr for taking a look back at it. I agree that (1) would be ideal but I'd leave this to a separate issue. Making LastCommitInfo, BlobLfsInfo, BlobSecurityInfo, ... immutable will require some internal changes that are not trivial - especially to make sure things are staying backward compatible. However given the numbers, I agree we should find a solution to replace the deepcopy. What do you think of implementing a weak copy specific to our needs. Something like def _weak_copy_ls_tree(items: List) -> List:
return [
{
"name": item["name"],
"size": item["size"],
"type": item["type"],
"blob_id": item.get("blob_id"),
"tree_id": item.get("tree_id"),
"lfs": BlobLfsInfo(**item.get("lfs")) if item.get("lfs") is not None else None,
"last_commit": LastCommitInfo(**item.get("last_commit")) if item.get("last_commit") is not None else None,
"security": BlobSecurityInfo(**item.get("security")) if item.get("security") is not None else None,
}
for item in items
] would work and should be faster than a generic deepcopy. Would you like to give it a try? |
Thanks @Wauplin! I had the same thought, so, there either are not nested mutable objects, or, we are not concerned with mutations of those modifying the dircache? This approach will certainly improve the performance of ls tree, but may introduce the same dircache polluting problems, e.g. BlobSecurityInfo's field av_scan is also a mutable dictionary. |
|
72e192a
to
773070c
Compare
773070c
to
db7a6ec
Compare
Hey, sorry I'm a bit late here; I've been a little busy. I've updated the PR to shallow copy in the manner described; I used copy.copy for the shallow copy; it perform identically to @Wauplin's code. So, this PR now moves the logic for making the cache_path_info into a single function, and that function can return either a shallow copy, or not. I almost revised to
to be called as _shallow_copy(out, depth=1), but it felt a little too generic. |
I'll profile this again on Monday or so... |
Thanks for getting back to this @awgr! Actually I think that if we go for a "non-safe-but-ok" solution, we should have a flag to enable/disable deepcopy that would be set to False by disable (i.e. to not deepcopy). And when we don't deepcopy, no need to make copies at all (i;e. no intermediate state where we partially copy stuff). |
I'm not a fan of a flag that would be hidden/rarely used (does not follow the spec). I couldn't find a |
Ok for not implement the flag then. Thanks for letting know about other fsspec implementations. |
Performing shallow copy is also quite expensive (on large repos) based on this simple benchmark: import timeit
setup1 = """\
from huggingface_hub import HfFileSystem
fs = HfFileSystem(skip_instance_cache=True)
"""
t1 = timeit.timeit("fs.find('datasets/bigcode/the-stack-dedup')", setup=setup1, number=20)
print("Avg time taken:", t1)
setup2 = """\
from huggingface_hub import HfFileSystem
fs = HfFileSystem(skip_instance_cache=True)
fs.find('datasets/bigcode/the-stack-dedup')
"""
t2 = timeit.timeit("fs.find('datasets/bigcode/the-stack-dedup')", setup=setup2, number=20)
print("Avg time taken (cached):", t2) Results:
The PS: This PR does not properly copy the cached |
Yes I am! Thanks for making benchmarking these. If a downstream script (or the tests) really wants to modify the output, then they copy make the copy themselves. |
…x/performance_of_ls_tree
It seems we've come full circle 😁. Dropping the copy is ideal for me as it achieves the ~3x speed up we recorded in the linked issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally approving! 🎉
Thanks both for the insights and discussion! At least we'll be able to come back here in the future if we wonder why we took this decision :)
No description provided.