Skip to content

iterable_subprocess based annexworktree() #539

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 44 commits into from
Dec 6, 2023

Conversation

mih
Copy link
Member

@mih mih commented Nov 21, 2023

This is moving PR mih#3 here, to get it tested properly. Below is the original description by @christian-monch

It includes (sits on top of) #538 (merged now)

TODO


This PR adds an implementation of iter_annexworktree that is based on iterable_subprocesses and the ideas laid out in issue #537.

This PR included a collection of data processors that are basically generator-wrapper.

The PR also modifies iter_gitworktree to use iterable_subprocesses instead of the datalad-core runner.

The current implementation of iter_annexworktree iterates over a dataset with 33k annex files in less than 5 seconds on my machine.

Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (8f404a1) 92.73% compared to head (8f27b7d) 92.87%.

Files Patch % Lines
datalad_next/iter_collections/annexworktree.py 98.48% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #539      +/-   ##
==========================================
+ Coverage   92.73%   92.87%   +0.14%     
==========================================
  Files         137      143       +6     
  Lines       10157    10365     +208     
  Branches     1103     1141      +38     
==========================================
+ Hits         9419     9627     +208     
  Misses        714      714              
  Partials       24       24              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mih

This comment was marked as outdated.

@mih

This comment was marked as outdated.

@mih mih mentioned this pull request Nov 22, 2023
christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Nov 22, 2023
This commit renames the module
`datalad_next.processors` to
`datalad_next.itertools`. This makes more
sense since the functions that are defined
in the module operate on iterables and their
results are themselves iterable.

Renaming was suggested in the review comment:
datalad#539 (comment)
@mih
Copy link
Member Author

mih commented Nov 22, 2023

After normalizing the benchmark conditions further, I have the following stats:

# this code
In [6]: %timeit list(iter_annexworktree('.'))
3.59 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# original sketch
In [7]: %timeit run_itersubproc()
4.17 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# baseline with iterable_subprocess
In [9]: %timeit list(iter_gitworktree('.'))
302 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This compared to a shell pipe (using jq as consumer that added the JSON decoding load):

❯ multitime -n 5 git ls-files | git annex find --anything --format="\${key}\n" --batch | grep -v '^[[:space:]]*$' | git annex examinekey --json --batch | jq > /dev/null
===> multitime results
1: git ls-files
            Mean        Std.Dev.    Min         Median      Max
real        3.399       0.169       3.236       3.369       3.703       
user        0.006       0.005       0.000       0.003       0.012       
sys         0.006       0.006       0.000       0.008       0.015       

~8% difference, and nearly within the margin of error.

This works for me!

@mih

This comment was marked as outdated.

christian-monch added a commit to christian-monch/datalad-next that referenced this pull request Nov 23, 2023
This commit renames the module
`datalad_next.processors` to
`datalad_next.itertools`. This makes more
sense since the functions that are defined
in the module operate on iterables and their
results are themselves iterable.

Renaming was suggested in the review comment:
datalad#539 (comment)
@christian-monch

This comment was marked as outdated.

@christian-monch christian-monch force-pushed the iter-annexworktree-subproc branch 4 times, most recently from 49765e5 to e0e21dd Compare November 23, 2023 15:06
mih added a commit to mih/datalad-next that referenced this pull request Nov 23, 2023
This addresses an issues brought up in
datalad#539 (comment)

This changeset removes the default argument to avoid the impression that
"line-processing" is the main target. The code does not imply that,
and the existing usage also not.

The possibility to do line-splitting is not touched (or removed). The
documentation needs no adjustment.
christian-monch and others added 12 commits December 6, 2023 09:55
Thie commit uses a `b'\n'` separator when itemizing
the output of `git annex find` and does not keep
line endings. This simplifies the call to
`itemize` and the test for a non-empty key in
the splitter-function of the enclosing
`route_out`.

It also requires to add `b'\n'` to drive the
consuming `git annex examinekey`-subprocess.
This is done with `intersperse`.
This commit reduces the number of concepts in
the implementation of `route_in` and `route_out`.
It removes the additional use of booleans in
favor of solely using `StoreOnly`.

It also replaces tuple indices with semantically
named variables.
This commit does:

- add a docstring for `iter_annexworktree`,
- include `iter_annexworktree`-documentation
  in the module documentation.
It is modeled after that of `iter_gitworktree()`, but aims to avoid
duplication with it.

The change also fixes various issues in the source documentation,
discovered in this process.
@mih mih force-pushed the iter-annexworktree-subproc branch from 2d34199 to 0fd61d6 Compare December 6, 2023 08:55
mih added 2 commits December 6, 2023 10:32
Previously, it would only open annex objects.

Now regular files (tracked or untracked) and symlink targets
(via the symlink) are also opened, if they actually exist.

The corresponding test is extended appropriately.
Minimal change, because we just pass it on to `iter_gitworktree()`.
Still added a smoke test.

This is now ready for use in Gooey.

Ping datalad#323
@mih mih force-pushed the iter-annexworktree-subproc branch from aa1b4b3 to 7a6deed Compare December 6, 2023 09:32
@mih

This comment was marked as resolved.

mih added a commit to mih/datalad-next that referenced this pull request Dec 6, 2023
This generalizes an approach from
datalad#539. It is implemented in a
way that enables reuse of the helpers in that PR too.

With this change regular files (tracked or untracked) and symlink targets
(via the symlink) are also opened, if they actually exist.

Closes datalad#553
@mih
Copy link
Member Author

mih commented Dec 6, 2023

#555 brings helpers that can be used to remove duplication of fp=True handling.

mih added a commit to mih/datalad-next that referenced this pull request Dec 6, 2023
This generalizes an approach from
datalad#539. It is implemented in a
way that enables reuse of the helpers in that PR too.

With this change regular files (tracked or untracked) and symlink targets
(via the symlink) are also opened, if they actually exist.

Closes datalad#553
christian-monch and others added 4 commits December 6, 2023 12:30
This commit fixes two issues with tests under Windows:

1. Test files were windows-1252 encoded
2. Line ending in saved test files do not match the line
   endings that were provided in `Path.write_text`, if
   executed under Windows

The issues are fixed by:

1. Specifying encoding='utf-8' in `Path.write_text()`
2. Not using line-endings in test-file content
This commit fixes a format string in a git annex find
command. The documentation states that a ``\n´´, i.e.
two characters: a backslash and an ``n´´, instructs
git annex to write a newline-character.
We have used the python string '\n', i.e. a single
character: newline. Although git annex seems to
accept newline and emits a newline, that is
undocumented behavior and should therefore not be
used.
@mih mih merged commit 26e6cb4 into datalad:main Dec 6, 2023
@mih mih deleted the iter-annexworktree-subproc branch December 6, 2023 13:18
@mih
Copy link
Member Author

mih commented Dec 6, 2023

This is done! A monumental effort. Thanks @christian-monch !

@mih mih added this to the 1.1 milestone Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants