Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram diff #289

Merged
merged 27 commits into from
Oct 11, 2024
Merged

Histogram diff #289

merged 27 commits into from
Oct 11, 2024

Conversation

gcanat
Copy link
Contributor

@gcanat gcanat commented Sep 30, 2024

#225

Implements histogram diff. I had to change quite a bit how the diff is printed because I believe having to annotate each line of the files with Change::{Unchanged, Deleted, Inserted} etc was not optimal. We just need to know where the hunks start and end.

Todo:

  • fix print_context()
  • fix print_edit_script()
  • test more thoroughly the output of unified print and default print.
  • refactor create_hunks_from_lcs()
  • fix unit tests

Nice to have

  • more code cleaning: for example, we probably dont need the ChangeData struct at all in the end.
  • eventually add a fallback method for cases where histogram diff is known to under-perform (big repetitive files), eg Myers diff

@jgarzik
Copy link
Contributor

jgarzik commented Oct 1, 2024

Are any of the community crates sufficiently low-dep and worth using? https://crates.io/crates/imara-diff or https://crates.io/crates/diff-match-patch-rs or https://crates.io/crates/diff ?

@gcanat
Copy link
Contributor Author

gcanat commented Oct 1, 2024

Ah yes, I didn't think that was an option. But indeed imara-diff has just two dependencies: ahash and hashbrown. It is probably the fastest diff library out there.

@jgarzik
Copy link
Contributor

jgarzik commented Oct 1, 2024

Ah yes, I didn't think that was an option. But indeed imara-diff has just two dependencies: ahash and hashbrown. It is probably the fastest diff library out there.

std-only is certainly the preference.

However, low-dep and no-dep algorithm crates that avoid our reinventing the wheel are OK as an exception.

We want to avoid "heavy" deps such as serde, nix, rustix, tokio etc. (alas, still working on that...). And avoid low-quality, incomplete crates.

@gcanat
Copy link
Contributor Author

gcanat commented Oct 7, 2024

Ok so the default diff, unified diff and edit_script are working fine. I think the implementation is quite performant. Just tested by cloning helix in two different dirs, checking out for one of them to 22.08.1 and doing target/release/diff -ru dir1 dir2 > helix.diff. It produces a 67k lines diff file in 140ms.

I just need to finish updating print_context() and then we should have a decent mvp.

@gcanat gcanat marked this pull request as ready for review October 10, 2024 13:01
@gcanat
Copy link
Contributor Author

gcanat commented Oct 10, 2024

Ready for review I believe.

@gcanat gcanat changed the title [WIP] Histogram diff Histogram diff Oct 10, 2024
Copy link
Contributor

@jgarzik jgarzik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jgarzik jgarzik merged commit 2132d42 into rustcoreutils:main Oct 11, 2024
2 checks passed
@jgarzik
Copy link
Contributor

jgarzik commented Oct 11, 2024

post-merge warning added to build:

   Compiling posixutils-text v0.2.1 (/Users/jgarzik/repo/posixutils-rs/text)
warning: variant `Unchanged` is never constructed
  --> text/diff_util/change.rs:17:5
   |
14 | pub enum Change {
   |          ------ variant in this enum
...
17 |     Unchanged(ChangeData),
   |     ^^^^^^^^^
   |
   = note: `Change` has derived impls for the traits `Debug` and `Clone`, but these are intentionally ignored during dead code analysis
   = note: `#[warn(dead_code)]` on by default

warning: `posixutils-text` (bin "diff" test) generated 1 warning
warning: `posixutils-text` (bin "diff") generated 1 warning (1 duplicate)

also, increasing the speed of diff tests would be nice; they are the slowest tests in posixutils.

@gcanat gcanat deleted the naive_histo_diff branch October 11, 2024 06:03
@gcanat
Copy link
Contributor Author

gcanat commented Oct 11, 2024

Opened a PR to fix the issue with Change enum.
I dont know why the test speed is slow. It seems to lag on startup, maybe some I/O stuff ? When they start running they are fast: finished in 0.01s.
tree-tests is slower: finished in 3.08s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants