`iterable_subprocess` based `annexworktree()` #539

mih · 2023-11-21T16:21:13Z

This is moving PR mih#3 here, to get it tested properly. Below is the original description by @christian-monch

It includes (sits on top of) #538 (merged now)

TODO

Rebase on Add a mode and test for non-recursive iter_gitworktree(), plus fixes #552
Add documentation
Make keep_ends=True str.strip() combo obsolete (and ease debugging why @mih cannot do it, ideally)
fp=True currently only does meaningful things for annexed files, but it should also act properly for any other file. This aspect of the functionality is still undertested.

This PR adds an implementation of iter_annexworktree that is based on iterable_subprocesses and the ideas laid out in issue #537.

This PR included a collection of data processors that are basically generator-wrapper.

The PR also modifies iter_gitworktree to use iterable_subprocesses instead of the datalad-core runner.

The current implementation of iter_annexworktree iterates over a dataset with 33k annex files in less than 5 seconds on my machine.

datalad_next/processors/__init__.py

datalad_next/processors/decode_processor.py

datalad_next/processors/json_processor.py

datalad_next/processors/lines_processor.py

codecov · 2023-11-21T16:54:38Z

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (8f404a1) 92.73% compared to head (8f27b7d) 92.87%.

Files	Patch %	Lines
datalad_next/iter_collections/annexworktree.py	98.48%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #539      +/-   ##
==========================================
+ Coverage   92.73%   92.87%   +0.14%     
==========================================
  Files         137      143       +6     
  Lines       10157    10365     +208     
  Branches     1103     1141      +38     
==========================================
+ Hits         9419     9627     +208     
  Misses        714      714              
  Partials       24       24

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

datalad_next/processors/lines_processor.py

datalad_next/processors/pattern_processor.py

datalad_next/processors/reroute_processor.py

This commit renames the module `datalad_next.processors` to `datalad_next.itertools`. This makes more sense since the functions that are defined in the module operate on iterables and their results are themselves iterable. Renaming was suggested in the review comment: datalad#539 (comment)

mih · 2023-11-22T10:27:04Z

After normalizing the benchmark conditions further, I have the following stats:

# this code
In [6]: %timeit list(iter_annexworktree('.'))
3.59 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# original sketch
In [7]: %timeit run_itersubproc()
4.17 s ± 317 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# baseline with iterable_subprocess
In [9]: %timeit list(iter_gitworktree('.'))
302 ms ± 8.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This compared to a shell pipe (using jq as consumer that added the JSON decoding load):

❯ multitime -n 5 git ls-files | git annex find --anything --format="\${key}\n" --batch | grep -v '^[[:space:]]*$' | git annex examinekey --json --batch | jq > /dev/null
===> multitime results
1: git ls-files
            Mean        Std.Dev.    Min         Median      Max
real        3.399       0.169       3.236       3.369       3.703       
user        0.006       0.005       0.000       0.003       0.012       
sys         0.006       0.006       0.000       0.008       0.015

~8% difference, and nearly within the margin of error.

This works for me!

datalad_next/processors/lines_processor.py

datalad_next/processors/decode_processor.py

datalad_next/processors/json_processor.py

This commit renames the module `datalad_next.processors` to `datalad_next.itertools`. This makes more sense since the functions that are defined in the module operate on iterables and their results are themselves iterable. Renaming was suggested in the review comment: datalad#539 (comment)

This addresses an issues brought up in datalad#539 (comment) This changeset removes the default argument to avoid the impression that "line-processing" is the main target. The code does not imply that, and the existing usage also not. The possibility to do line-splitting is not touched (or removed). The documentation needs no adjustment.

Thie commit uses a `b'\n'` separator when itemizing the output of `git annex find` and does not keep line endings. This simplifies the call to `itemize` and the test for a non-empty key in the splitter-function of the enclosing `route_out`. It also requires to add `b'\n'` to drive the consuming `git annex examinekey`-subprocess. This is done with `intersperse`.

This commit reduces the number of concepts in the implementation of `route_in` and `route_out`. It removes the additional use of booleans in favor of solely using `StoreOnly`. It also replaces tuple indices with semantically named variables.

This commit does: - add a docstring for `iter_annexworktree`, - include `iter_annexworktree`-documentation in the module documentation.

It is modeled after that of `iter_gitworktree()`, but aims to avoid duplication with it. The change also fixes various issues in the source documentation, discovered in this process.

Documents what is TODO.

Previously, it would only open annex objects. Now regular files (tracked or untracked) and symlink targets (via the symlink) are also opened, if they actually exist. The corresponding test is extended appropriately.

Minimal change, because we just pass it on to `iter_gitworktree()`. Still added a smoke test. This is now ready for use in Gooey. Ping datalad#323

This generalizes an approach from datalad#539. It is implemented in a way that enables reuse of the helpers in that PR too. With this change regular files (tracked or untracked) and symlink targets (via the symlink) are also opened, if they actually exist. Closes datalad#553

mih · 2023-12-06T10:35:41Z

#555 brings helpers that can be used to remove duplication of fp=True handling.

This generalizes an approach from datalad#539. It is implemented in a way that enables reuse of the helpers in that PR too. With this change regular files (tracked or untracked) and symlink targets (via the symlink) are also opened, if they actually exist. Closes datalad#553

This commit fixes two issues with tests under Windows: 1. Test files were windows-1252 encoded 2. Line ending in saved test files do not match the line endings that were provided in `Path.write_text`, if executed under Windows The issues are fixed by: 1. Specifying encoding='utf-8' in `Path.write_text()` 2. Not using line-endings in test-file content

This commit fixes a format string in a git annex find command. The documentation states that a ``\n´´, i.e. two characters: a backslash and an ``n´´, instructs git annex to write a newline-character. We have used the python string '\n', i.e. a single character: newline. Although git annex seems to accept newline and emits a newline, that is undocumented behavior and should therefore not be used.

mih · 2023-12-06T13:18:58Z

This is done! A monumental effort. Thanks @christian-monch !

mih mentioned this pull request Nov 21, 2023

Iter annexworktree subproc mih/datalad-next#3

Closed