digest.R optimization 1: hash algo lookup #218

pearsonca · 2024-08-26T15:16:00Z

Extracted from #215

…t-based

the alternative approach here would be to change the default argument -- but from a user standpoint, i think this version of the default is a good way to communicate options

R/digest.R

pearsonca · 2024-08-26T15:20:23Z

profvis::profvis(replicate(1000, digest("8675309", algo = "md5", serialize = FALSE, raw = TRUE)))

indicated 4 hot spots: (1) converting algo, (2) converting errormode, (3) converting length, and (4) running the algorithm. This addresses (1)

pearsonca · 2024-08-26T15:31:23Z

(also, deferring the ChangeLog update to (3)?)

eddelbuettel · 2024-08-26T15:32:24Z

Let's please get back down to the convention that every PR has a ChangeLog entry. Thank you.

eddelbuettel · 2024-08-26T15:46:57Z

R/digest.R

@@ -107,7 +106,7 @@ digest <- function(object, algo=c("md5", "sha1", "crc32", "sha256", "sha512",
    if (!is_streaming_algo) {
        val <- .Call(digest_impl,
                     object,
-                     as.integer(algoint),
+                     algoint,
                     as.integer(length),
                     as.integer(skip),
                     as.integer(raw),


The entire block is actually "wrong" in the sense of when I wrote it I had likely just learned about .Call() over .C() and the latter requires this -- but the former does not.

So we could extend this PR (if you're game) to avoiding / minimising the casting of the other arguments length, skip and raw. (I will have to check later what type of variable is used at the C level.)

I am thinking of my approach to this as being as each optimization feature comes in, remove the cast here. (definitely all these casts can go away, as the casting can be handled correctly on the C side by the SEXP interpreters).

which is to say, yes, I'll definitely address these. Question: do you want them addressed in this PR or as part of this series?

We can take it one at a time too.

Now, out of curiousity, and I buy into the PR just for "cleaner expression of intent" (ie algo choice as int) but what performance gain do you claim this gets, and is that not likely only relevant for very frequent calls with really small payloads? They exist, but also consider the fortune below ...

> fortunes::fortune("save") # it got lucky here, 'save' can also retrieve / match other ones... Paul Gilbert: [code comparing speed of apply(z,2,sum) vs. rep(1,10000)%*%z)] which seemed completely contrary to all my childhood teachings. Douglas Bates: Must have had an interesting childhood if you spent it learning about the speeds of various matrix multiplication techniques. Paul Gilbert: [...] why is apply so slow? Brian Ripley: 'so slow' sic: what are you going to do in the 7ms you saved? -- Paul Gilbert, Douglas Bates, and Brian D. Ripley (discussing 'the incredible lightness of crossprod') R-devel (January 2005) >

I have yet to systematically look at any "large" payload cases. For "small" payload cases (order 10 characters), profvis indicates that the match.arg lines are equal to or greater than the actual Call itself.

I have checked 1e6 character objects, and these parts of the R go down to ~10% of the total (instead of >=2/3rd).

eddelbuettel · 2024-08-26T16:11:32Z

R/digest.R

@@ -54,10 +53,10 @@ digest <- function(object, algo=c("md5", "sha1", "crc32", "sha256", "sha512",
        file <- TRUE                  	# nocov
    }

-    is_streaming_algo <- algo == "spookyhash"
+    is_streaming_algo <- algoint == 9L


That's trickier. The is time in code execution, and there is clarity in code. This no longer wins on the second aspect.

(We could be tricky and make it a factor. Explicit, easy to read, still pass an integer down.)

can do it as algo[1] comparison instead?

(i think the reason the algo <- algo[1] reassignment was slowing things down is triggering when R treats function arguments as read-only vs not, and inadvertantly triggering a copy operation?)

pearsonca · 2024-08-26T17:54:03Z

Alright, having sorted out some what's installed where, what session are we in issues:

randomwords <- sample(c(LETTERS, letters, 0:9), 1e6, replace = TRUE)
randomword <- paste0(randomwords, collapse = "")
profvis::profvis(lapply(randomwords, \(w) digest(w, algo = "md5", serialize = FALSE, raw = TRUE)))
profvis::profvis(replicate(1e3, digest(randomword, algo = "md5", serialize = FALSE, raw = TRUE)))

The second one (a lot of really little hashes) is interesting (and this is with this revision; prior to this, algo-related lines totalled up to ~what's seen here for errormode)

The first one (a few big hashes), not really. the line translating errormode does actually show up, but corresponds to <1% of the sampled run time.

eddelbuettel · 2024-08-27T23:58:48Z

Sorry, been tied up in other things. Now, I may need to mull this over a little bit more but I have the feeling that we are making 'mostly light and no heat here'. I am not convinced yet we need a change for performance: I can live with time spent on match.arg() in a perfectly valid and common code idiom. Also, as noticed, at least one of the suggested changes reduces code clarity in my eyes.

pearsonca · 2024-08-28T12:27:34Z

Sorry, been tied up in other things. Now, I may need to mull this over a little bit more but I have the feeling that we are making 'mostly light and no hear here'. I am not convinced yet we need a change for performance: I can live with time spent on match.arg() in a perfectly valid and common code idiom. Also, as noticed, at least one of the suggested changes reduces code clarity in my eyes.

I'll standby for further discussion then. From my end, for {digest} to work for my case it needs to be much faster on the small-hashes (and that seems to require mostly work on the R overhead), but there are other approaches I can use (e.g. just wrapping xxh in that package).

BREAK

I do think the errormode version of solving the performance problem could just be pushing the match.arg into the error checking function, without loss of code clarity. I can do that as a PR, if you like.

eddelbuettel · 2024-08-28T12:55:40Z

and that seems to require mostly work on the R overhead

I said that earlier in passing: maybe it is worth trying an alternate, no-frills accessor.

How the vectorised access was added was quite nice, maybe we could think of something similar here without requiring baby and bathwater approaches.

I do think the errormode version of solving the performance problem could just be pushing the match.arg into the error checking function, without loss of code clarity. I can do that as a PR, if you like.

That has merit too. We would then not spend those microseconds checking an argument we do not use.

In short, let's try to refocus on a few, clear, simple, unambiguous improvements.

Oh and another approach we could consider (and something I now know better than when this started) is to export C-level accessors directly to other packages. Even less overhead.

pearsonca · 2024-08-28T13:29:11Z

Alright, initiated an trim to the errormode checking, since that's just-better.

Re no frills: how about an xxh(...) function along the lines of the sha() functionality? The potential problem there is that all the sha1(...) functions ultimately pass to digest, and this would be a different approach.

pearsonca added 4 commits August 26, 2024 07:13

change lookup approach for algorithm from function switch-case to lis…

596e8bc

…t-based

add change log

9474ec7

this reassignment is minor but noticeable.

9ab0fe9

the alternative approach here would be to change the default argument -- but from a user standpoint, i think this version of the default is a good way to communicate options

compare to spookyhash

bc951ab

pearsonca commented Aug 26, 2024

View reviewed changes

R/digest.R Outdated Show resolved Hide resolved

eddelbuettel reviewed Aug 26, 2024

View reviewed changes

pearsonca mentioned this pull request Aug 28, 2024

Move errormode check #219

Merged

pearsonca closed this Aug 28, 2024

pearsonca mentioned this pull request Aug 28, 2024

digest refactoring for performance #212

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

digest.R optimization 1: hash algo lookup #218

digest.R optimization 1: hash algo lookup #218

pearsonca commented Aug 26, 2024

pearsonca commented Aug 26, 2024

pearsonca commented Aug 26, 2024

eddelbuettel commented Aug 26, 2024

eddelbuettel Aug 26, 2024

pearsonca Aug 26, 2024 •

edited

Loading

pearsonca Aug 26, 2024

eddelbuettel Aug 26, 2024 •

edited

Loading

pearsonca Aug 26, 2024 •

edited

Loading

eddelbuettel Aug 26, 2024 •

edited

Loading

pearsonca Aug 26, 2024

pearsonca Aug 26, 2024

pearsonca commented Aug 26, 2024

eddelbuettel commented Aug 27, 2024 •

edited

Loading

pearsonca commented Aug 28, 2024 •

edited

Loading

eddelbuettel commented Aug 28, 2024

pearsonca commented Aug 28, 2024

digest.R optimization 1: hash algo lookup #218

digest.R optimization 1: hash algo lookup #218

Conversation

pearsonca commented Aug 26, 2024

pearsonca commented Aug 26, 2024

pearsonca commented Aug 26, 2024

eddelbuettel commented Aug 26, 2024

eddelbuettel Aug 26, 2024

Choose a reason for hiding this comment

pearsonca Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

pearsonca Aug 26, 2024

Choose a reason for hiding this comment

eddelbuettel Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

pearsonca Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

eddelbuettel Aug 26, 2024 • edited Loading

Choose a reason for hiding this comment

pearsonca Aug 26, 2024

Choose a reason for hiding this comment

pearsonca Aug 26, 2024

Choose a reason for hiding this comment

pearsonca commented Aug 26, 2024

eddelbuettel commented Aug 27, 2024 • edited Loading

pearsonca commented Aug 28, 2024 • edited Loading

eddelbuettel commented Aug 28, 2024

pearsonca commented Aug 28, 2024

pearsonca Aug 26, 2024 •

edited

Loading

eddelbuettel Aug 26, 2024 •

edited

Loading

pearsonca Aug 26, 2024 •

edited

Loading

eddelbuettel Aug 26, 2024 •

edited

Loading

eddelbuettel commented Aug 27, 2024 •

edited

Loading

pearsonca commented Aug 28, 2024 •

edited

Loading