Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

digest.R optimization 1: hash algo lookup #218

Closed
wants to merge 4 commits into from

Conversation

pearsonca
Copy link
Contributor

Extracted from #215

the alternative approach here would be to change the default argument -- but from a user standpoint, i think this version of the default is a good way to communicate options
R/digest.R Outdated Show resolved Hide resolved
@pearsonca
Copy link
Contributor Author

profvis::profvis(replicate(1000, digest("8675309", algo = "md5", serialize = FALSE, raw = TRUE)))

indicated 4 hot spots: (1) converting algo, (2) converting errormode, (3) converting length, and (4) running the algorithm. This addresses (1)

@pearsonca
Copy link
Contributor Author

(also, deferring the ChangeLog update to (3)?)

@eddelbuettel
Copy link
Owner

Let's please get back down to the convention that every PR has a ChangeLog entry. Thank you.

@@ -107,7 +106,7 @@ digest <- function(object, algo=c("md5", "sha1", "crc32", "sha256", "sha512",
if (!is_streaming_algo) {
val <- .Call(digest_impl,
object,
as.integer(algoint),
algoint,
as.integer(length),
as.integer(skip),
as.integer(raw),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire block is actually "wrong" in the sense of when I wrote it I had likely just learned about .Call() over .C() and the latter requires this -- but the former does not.

So we could extend this PR (if you're game) to avoiding / minimising the casting of the other arguments length, skip and raw. (I will have to check later what type of variable is used at the C level.)

Copy link
Contributor Author

@pearsonca pearsonca Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking of my approach to this as being as each optimization feature comes in, remove the cast here. (definitely all these casts can go away, as the casting can be handled correctly on the C side by the SEXP interpreters).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which is to say, yes, I'll definitely address these. Question: do you want them addressed in this PR or as part of this series?

Copy link
Owner

@eddelbuettel eddelbuettel Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can take it one at a time too.

Now, out of curiousity, and I buy into the PR just for "cleaner expression of intent" (ie algo choice as int) but what performance gain do you claim this gets, and is that not likely only relevant for very frequent calls with really small payloads? They exist, but also consider the fortune below ...

> fortunes::fortune("save")       # it got lucky here, 'save' can also retrieve / match other ones...

Paul Gilbert: [code comparing speed of apply(z,2,sum) vs. rep(1,10000)%*%z)] which seemed completely contrary to all my childhood teachings.
Douglas Bates: Must have had an interesting childhood if you spent it learning about the speeds of various matrix multiplication techniques.
Paul Gilbert: [...] why is apply so slow?
Brian Ripley: 'so slow' sic: what are you going to do in the 7ms you saved?
   -- Paul Gilbert, Douglas Bates, and Brian D. Ripley (discussing 'the incredible lightness of crossprod')
      R-devel (January 2005)

> 

Copy link
Contributor Author

@pearsonca pearsonca Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have yet to systematically look at any "large" payload cases. For "small" payload cases (order 10 characters), profvis indicates that the match.arg lines are equal to or greater than the actual Call itself.

I have checked 1e6 character objects, and these parts of the R go down to ~10% of the total (instead of >=2/3rd).

R/digest.R Outdated
@@ -54,10 +53,10 @@ digest <- function(object, algo=c("md5", "sha1", "crc32", "sha256", "sha512",
file <- TRUE # nocov
}

is_streaming_algo <- algo == "spookyhash"
is_streaming_algo <- algoint == 9L
Copy link
Owner

@eddelbuettel eddelbuettel Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's trickier. The is time in code execution, and there is clarity in code. This no longer wins on the second aspect.

(We could be tricky and make it a factor. Explicit, easy to read, still pass an integer down.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do it as algo[1] comparison instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i think the reason the algo <- algo[1] reassignment was slowing things down is triggering when R treats function arguments as read-only vs not, and inadvertantly triggering a copy operation?)

@pearsonca
Copy link
Contributor Author

Alright, having sorted out some what's installed where, what session are we in issues:

randomwords <- sample(c(LETTERS, letters, 0:9), 1e6, replace = TRUE)
randomword <- paste0(randomwords, collapse = "")
profvis::profvis(lapply(randomwords, \(w) digest(w, algo = "md5", serialize = FALSE, raw = TRUE)))
profvis::profvis(replicate(1e3, digest(randomword, algo = "md5", serialize = FALSE, raw = TRUE)))

The second one (a lot of really little hashes) is interesting (and this is with this revision; prior to this, algo-related lines totalled up to ~what's seen here for errormode)
Screenshot from 2024-08-26 09-47-41

The first one (a few big hashes), not really. the line translating errormode does actually show up, but corresponds to <1% of the sampled run time.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Aug 27, 2024

Sorry, been tied up in other things. Now, I may need to mull this over a little bit more but I have the feeling that we are making 'mostly light and no heat here'. I am not convinced yet we need a change for performance: I can live with time spent on match.arg() in a perfectly valid and common code idiom. Also, as noticed, at least one of the suggested changes reduces code clarity in my eyes.

@pearsonca
Copy link
Contributor Author

pearsonca commented Aug 28, 2024

Sorry, been tied up in other things. Now, I may need to mull this over a little bit more but I have the feeling that we are making 'mostly light and no hear here'. I am not convinced yet we need a change for performance: I can live with time spent on match.arg() in a perfectly valid and common code idiom. Also, as noticed, at least one of the suggested changes reduces code clarity in my eyes.

I'll standby for further discussion then. From my end, for {digest} to work for my case it needs to be much faster on the small-hashes (and that seems to require mostly work on the R overhead), but there are other approaches I can use (e.g. just wrapping xxh in that package).

BREAK

I do think the errormode version of solving the performance problem could just be pushing the match.arg into the error checking function, without loss of code clarity. I can do that as a PR, if you like.

@eddelbuettel
Copy link
Owner

and that seems to require mostly work on the R overhead

I said that earlier in passing: maybe it is worth trying an alternate, no-frills accessor.

How the vectorised access was added was quite nice, maybe we could think of something similar here without requiring baby and bathwater approaches.

I do think the errormode version of solving the performance problem could just be pushing the match.arg into the error checking function, without loss of code clarity. I can do that as a PR, if you like.

That has merit too. We would then not spend those microseconds checking an argument we do not use.

In short, let's try to refocus on a few, clear, simple, unambiguous improvements.

Oh and another approach we could consider (and something I now know better than when this started) is to export C-level accessors directly to other packages. Even less overhead.

@pearsonca pearsonca mentioned this pull request Aug 28, 2024
@pearsonca
Copy link
Contributor Author

Alright, initiated an trim to the errormode checking, since that's just-better.

Re no frills: how about an xxh(...) function along the lines of the sha() functionality? The potential problem there is that all the sha1(...) functions ultimately pass to digest, and this would be a different approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants