Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

digest refactoring for performance #212

Open
1 of 4 tasks
pearsonca opened this issue Aug 21, 2024 · 5 comments
Open
1 of 4 tasks

digest refactoring for performance #212

pearsonca opened this issue Aug 21, 2024 · 5 comments

Comments

@pearsonca
Copy link
Contributor

pearsonca commented Aug 21, 2024

While working #210, I noticed opportunities for opportunites for consolidating / refactoring code in digest.c.

In some sense, digest.c is providing a Facade around interacting with the various quirks of the imported hashing libraries. It does some setup / reinterpretation of the input from R, uses a series of cases to handle each of the different algorithms, and then returns a consistent output.

It seems over time those cases have been internally implemented in slightly different ways (and I imagine, the hasher interfaces have changed over time after initial implementation). There is now a fair bit of repetitious code and/or inconsistent naming/etc that could be consolidated. This would make the assorted "Consider X" issues smaller lifts to accomplish, present opportunities for a bit of unit testing on the C side, benchmarking, etc - all the normal benefits of DRY.

I propose the following refactoring objectives:

  • refactor macros to functions (per discussion in expand hashing algorithms that support raw = TRUE option #210)
  • eliminate the duplicate code for streaming vs not approaches (e.g. by figuring out how to share the setup / return code, while switching between stream vs blob inputs)
  • more clearly distinguish "jobs" of various functions (e.g. I imagine digest would be as a more of traffic manager, with the individual hasher interface logic being modularized out; then adding a new hasher means adding a new consistently-surfaced-function + another case to digest and presto)
  • additional linting (e.g. spaces around binary operators)

Thoughts? Additional objectives? I think some of these (and possibly additional targets added via discussion) are entangled, but there's some potential to isolate at least some of these as their own PRs - preferences one way or another?

@eddelbuettel
Copy link
Owner

I am in favour of 1).

I am a little scared / see more downside than upside for 2) and 3) --- things work as they are, we manage to add new hasher so why rock the boat

I have so far refrained from larger-scale "style" only overhauls. I like when git blame refers to original contributions. [ I left your two-spaces indented lines alone too. ] I guess on the margin in favour but let's not get overboard.

Too little to keep you excited?

@pearsonca
Copy link
Contributor Author

pearsonca commented Aug 21, 2024

Can do a PR for just converting macros => functions, sure.

Then maybe draft something for one of the others, and see how it looks w/o (initially) super polishing it? If it seems promising, great, proceed to polish; if not, can always toss it.

@pearsonca
Copy link
Contributor Author

pearsonca commented Aug 28, 2024

Porting discussion from closed PRs #215 / #218 here:

  • any overhead reductions in digest need maintain code readability
  • better approach may be to introduce no-frills functionality

What could potentially work for that second item:

  • offering a direct (R) interface for each of the algorithms, with minimal parsing / error checking
  • refactor digest to do the error handling etc, then dispatch to the individual mode algorithm functions (this would invert the current relationship with e.g. sha1 R exports, so non-trivial work to do)
  • longer term that would suggest wanting to have more individualized wrappers in the C layer, but that wouldn't need to happen immediately
  • if pursing that longer term approach: also potentially makes more direct-to-C options available for dependent packages, if more done there to export the C layer

aside: the longer term approach would be advantageous for enabling people to implement their class-specific hashing extensions ala #23 and for potentially having a digestext or similar package, that offers custom class-specific extensions without bloating the core library.

@pearsonca pearsonca changed the title digest.c refactoring digest refactoring for performance Aug 28, 2024
@pearsonca
Copy link
Contributor Author

Draft approach for a refactor that would allow users to skip some of the overhead optionally: #220 - would appreciate some feedback there about those outlines before pursuing further.

@pearsonca
Copy link
Contributor Author

In the broken up version of #220 - next step is #222. Pending feedback, this is essentially encapsulating all the algorithms. The next would be to register all of them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants