Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add _readdirx for returning more object info gathered during dir scan #53377

Merged
merged 2 commits into from
Feb 29, 2024

Conversation

IanButterworth
Copy link
Member

@IanButterworth IanButterworth commented Feb 18, 2024

Note: This has been converted into an internals only function for 1.11 as a backport for internal performance gains. In 1.12 it should be exported with a bikeshedded name.


Based on #53153 (comment) @vtjnash

The uv_fs_scandir that readdir uses gathers the object types, where known, but we currently do no use them.

This introduces an internal _readdirx and the FileKind DirEntry object (names tbd) to give the opportunity for faster isfile, isdir etc. by using that info. Also the objects retain the dir so both their name and path can be retained.

Hoping to help WSL, but it's not clear how common it is to have a filesystem that just returns UV_DIRENT_UNKNOWN.

On MacOS:

julia> @time count(isfile, readdir("/Users/ian/Downloads", join=true))
  0.001515 seconds (1.20 k allocations: 98.406 KiB)
219

julia> @time count(isfile, readdirx("/Users/ian/Downloads"))
  0.000237 seconds (245 allocations: 33.344 KiB)
219
julia> readdirx()
45-element Vector{Base.Filesystem.DirEntry}:
 Base.Filesystem.DirEntry(".", ".DS_Store", 1)
 Base.Filesystem.DirEntry(".", ".buildkite-external-version", 1)
 Base.Filesystem.DirEntry(".", ".clang-format", 1)
 Base.Filesystem.DirEntry(".", ".clangd", 1)
 Base.Filesystem.DirEntry(".", ".codecov.yml", 1)
 Base.Filesystem.DirEntry(".", ".devcontainer", 2)
...
julia> for obj in Base.Filesystem._readdirx(pwd())
           if isfile(obj) && obj == "README.md"
               @info "readme found: $(joinpath(obj))"
           end
       end
[ Info: readme found: /Users/ian/Documents/GitHub/julia/README.md

@IanButterworth IanButterworth added the filesystem Underlying file system and functions that use it label Feb 18, 2024
@IanButterworth IanButterworth force-pushed the ib/readdir_objects branch 2 times, most recently from ce462ac to 59164eb Compare February 18, 2024 02:47
base/file.jl Outdated Show resolved Hide resolved
base/file.jl Show resolved Hide resolved
@fatteneder
Copy link
Member

fatteneder commented Feb 18, 2024

Hoping to help WSL, but it's not clear how common it is to have a filesystem that just returns UV_DIRENT_UNKNOWN.

Just a side comment:
I think it would possible to detect the file system for which rawtype gives useful information using uv_fs_statfs https://docs.libuv.org/en/v1.x/fs.html#c.uv_fs_statfs.
The relevant fs type constants which are supported according to https://docs.libuv.org/en/v1.x/fs.html#c.uv_fs_scandir_next would be

BTRFS_SUPER_MAGIC     0x9123683e
BTRFS_TEST_MAGIC      0x73727279
EXT2_OLD_SUPER_MAGIC  0xef51
EXT2_SUPER_MAGIC      0xef53
EXT3_SUPER_MAGIC      0xef53
EXT4_SUPER_MAGIC      0xef53

But I think this is not of much help, and using instead cached stat, lstat fields should provide cheap enough isfile, isdir, ... calls in case rawtype is unknown.

@IanButterworth
Copy link
Member Author

I think readdirx shouldn't ever call stat, because it's too expensive given the user may not even need to stat a file that, for instance, doesn't have a name that matches a pattern.

This is more about not throwing information away, rather than always being complete

base/file.jl Outdated Show resolved Hide resolved
@topolarity
Copy link
Member

topolarity commented Feb 20, 2024

On WSL2, this gives me about a ~300x speedup (as expected) 🎉

julia> @time count(isfile, readdir("/mnt/c/Windows/system32", join=true))
 11.344859 seconds (25.32 k allocations: 1.850 MiB)
4913

julia> @time count(isfile, readdirx("/mnt/c/Windows/system32"))
  0.034974 seconds (5.08 k allocations: 545.961 KiB)
4913

@topolarity
Copy link
Member

Also provides a nice 50x speed-up (and avoids an EACCES error) on native Windows:

julia> @time count(f->try isfile(f) catch _ false end, readdir("/Windows/system32", join=true))
  0.230555 seconds (144.26 k allocations: 6.848 MiB, 3.29% gc time, 11.45% compilation time)
4913

julia> @time count(isfile, readdirx("/Windows/system32"))
  0.004718 seconds (5.08 k allocations: 664.711 KiB)
4913

@IanButterworth IanButterworth marked this pull request as ready for review February 22, 2024 03:17
@IanButterworth
Copy link
Member Author

I haven't come up with clearly better names so far, but some thoughts..

  • readdirx: readdirinfo makes sense but might be a bit awkward to read?

  • FileKind: Entry might make more sense as a generic filesystem entry, but maybe too generic without the filesystem context. It could be non-exported, so always show as Filesystem.Entry I think?

base/file.jl Outdated Show resolved Hide resolved
@topolarity
Copy link
Member

My vote would be to rename FileKind to DirEntry (esp. since this is the only place we expect such a struct to appear)

I actually like readdirx but maybe Unix has corrupted me. readdirinfo makes me feel like I'm doing a stat on a directory or similar.

base/file.jl Outdated Show resolved Hide resolved
base/file.jl Outdated Show resolved Hide resolved
base/file.jl Outdated Show resolved Hide resolved
@IanButterworth
Copy link
Member Author

If we backport this to 1.11 it can be used to speed up tab completion hint path scans (estimates of by ~300x on WSL, ~50x on Windows).

I'll mark it for triage to discuss whether that's reasonable.

@IanButterworth IanButterworth added the triage This should be discussed on a triage call label Feb 26, 2024
@LilithHafner
Copy link
Member

Triage thinks that it would be good to name this _readdirx and keep it internal, then merge and backport, and add this as a public feature in 1.12 (with a different name, possibly readdirx, but also possibly a different name, I and Oscar personally don't love the readdirx and don't want that bikeshed to block these perf improvements reaching tab completion).

@LilithHafner LilithHafner removed the triage This should be discussed on a triage call label Feb 29, 2024
@IanButterworth IanButterworth changed the title add readdirx for returning more object info gathered during dir scan add _readdirx for returning more object info gathered during dir scan Feb 29, 2024
Co-Authored-By: Cody Tapscott <84105208+topolarity@users.noreply.github.com>
@IanButterworth
Copy link
Member Author

@topolarity in view of the triage review that this should go in as an internal into 1.11 so we can figure out the proper name API etc. for 1.12, can you give a final review please and thanks

Copy link
Member

@topolarity topolarity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together!

Greatly looking forward to the performance improvements this will enable

@IanButterworth IanButterworth added the merge me PR is reviewed. Merge when all tests are passing label Feb 29, 2024
base/file.jl Outdated Show resolved Hide resolved
@LilithHafner LilithHafner added the backport 1.11 Change should be backported to release-1.11 label Feb 29, 2024
@IanButterworth
Copy link
Member Author

Just an idea for the bikshedded names readdirentries and DirEntry. Ties the two together

Co-Authored-By: Cody Tapscott <84105208+topolarity@users.noreply.github.com>
@IanButterworth IanButterworth merged commit 989c4db into JuliaLang:master Feb 29, 2024
7 checks passed
@inkydragon inkydragon removed the merge me PR is reviewed. Merge when all tests are passing label Mar 1, 2024
KristofferC pushed a commit that referenced this pull request Mar 1, 2024
@KristofferC KristofferC mentioned this pull request Mar 1, 2024
60 tasks
KristofferC added a commit that referenced this pull request Mar 17, 2024
Backported PRs:
- [x] #39071 <!-- Add a lazy `logrange` function and `LogRange` type -->
- [x] #51802 <!-- Allow AnnotatedStrings in log messages -->
- [x] #53369 <!-- Orthogonalize re-indexing for FastSubArrays -->
- [x] #48050 <!-- improve `--heap-size-hint` arg handling -->
- [x] #53482 <!-- add IR encoding for EnterNode -->
- [x] #53499 <!-- Avoid compiler warning about redefining jl_globalref_t
-->
- [x] #53507 <!-- update staled `Core.Compiler.Effects` documentation
-->
- [x] #53408 <!-- task splitting: change additive accumulation to
multiplicative -->
- [x] #53523 <!-- add back an alias for `check_top_bit` -->
- [x] #53377 <!-- add _readdirx for returning more object info gathered
during dir scan -->
- [x] #53525 <!-- fix InteractiveUtils call in Base.runtests on failure
-->
- [x] #53540 <!-- use more efficient `_readdirx` for tab completion -->
- [x] #53545 <!-- use `_readdirx` for `walkdir` -->
- [x] #53551 <!-- revert "Add @create_log_macro for making custom styled
logging macros (#52196)" -->
- [x] #53554 <!-- Always return a value in 1-d circshift! of
abstractarray.jl -->
- [x] #53424 <!-- yet more atomics & cache-line fixes on work-stealing
queue -->
- [x] #53571 <!-- Update Documenter to v1.3 for inventory writing -->
- [x] #53403 <!-- Move parallel precompilation to Base -->
- [x] #53589 <!-- add back `unsafe_convert` to pointer for arrays -->
- [x] #53596 <!-- build: remove extra .a file -->
- [x] #53606 <!-- fix error path in `precompilepkgs` -->
- [x] #53004 <!-- Unexport with, at_with, and ScopedValue from Base -->
- [x] #53629 <!-- typo fix in scoped values docs -->
- [x] #53630 <!-- sroa: Fix incorrect scope counting -->
- [x] #53598 <!-- Use Base parallel precompilation to build stdlibs -->
- [x] #53649 <!-- precompilepkgs: package in boths deps and weakdeps are
in fact only weak -->
- [x] #53671 <!-- Fix bootstrap Base precompile in cross compile
configuration -->
- [x] #52125 <!-- Load Pkg if not already to reinstate missing package
add prompt -->
- [x] #53602 <!-- Handle zero on arrays of unions of number types and
missings -->
- [x] #53516 <!-- permit NamedTuple{<:Any, Union{}} to be created -->
- [x] #53643 <!-- Bump CSL to 1.1.1 to fix libgomp bug -->
- [x] #53679 <!-- move precompile workload back from Base -->
- [x] #53663 <!-- add isassigned methods for reinterpretarray -->
- [x] #53662 <!-- [REPL] fix incorrectly cleared line after completions
accepted -->
- [x] #53611 <!-- Linalg: matprod_dest for Diagonal and adjvec -->
- [x] #53659 <!-- fix #52025, re-allow all implicit pointer casts in
cconvert for Array -->
- [x] #53631 <!-- LAPACK: validate input parameters to throw informative
errors -->
- [x] #53628 <!-- Make some improvements to the Scoped Values
documentation. -->
- [x] #53655 <!-- Change tbaa of ptr_phi to tbaa_value  -->
- [x] #53391 <!-- Default to the medium code model in x86 linux -->
- [x] #53699 <!-- Move `isexecutable, isreadable, iswritable` to
`filesystem.jl` -->
- [x] #41232 <!-- Fix linear indexing for ReshapedArray if the parent
has offset axes -->
- [x] #53527 <!-- Enable analyzegc checks for try catch and fix found
issues -->
- [x] #52092 
- [x] #53682 <!-- Increase build precompilation -->
- [x] #53720 
- [x] #53553 <!-- typeintersect: fix `UnionAll` unaliasing bug caused by
innervars. -->

Contains multiple commits, manual intervention needed:
- [ ] #53305 <!-- Propagate inbounds in isassigned with CartesianIndex
indices -->

Non-merged PRs with backport label:
- [ ] #53736 <!-- fix literal-pow to return the right type when the base
is -1 -->
- [ ] #53707 <!-- Make ScopedValue public -->
- [ ] #53696 <!-- add invokelatest to on_done callback in bracketed
paste -->
- [ ] #53660 <!-- put Logging back in default sysimage -->
- [ ] #53509 <!-- revert moving "creating packages" from Pkg.jl -->
- [ ] #53452 <!-- RFC: allow Tuple{Union{}}, returning Union{} -->
- [ ] #53402 <!-- Add `jl_getaffinity` and `jl_setaffinity` -->
- [ ] #52694 <!-- Reinstate similar for AbstractQ for backward
compatibility -->
- [ ] #51928 <!-- Styled markdown, with a few tweaks -->
- [ ] #51816 <!-- User-themable stacktraces -->
- [ ] #51811 <!-- Make banner size depend on terminal size -->
- [ ] #51479 <!-- prevent code loading from lookin in the versioned
environment when building Julia -->
@KristofferC KristofferC removed the backport 1.11 Change should be backported to release-1.11 label Mar 18, 2024
@IanButterworth IanButterworth deleted the ib/readdir_objects branch August 3, 2024 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
filesystem Underlying file system and functions that use it
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants