Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fcase / case_when function for data.table #4021

Merged
merged 20 commits into from
Dec 19, 2019
Merged

fcase / case_when function for data.table #4021

merged 20 commits into from
Dec 19, 2019

Conversation

2005m
Copy link
Contributor

@2005m 2005m commented Nov 1, 2019

Closes #3823
I just wanted to share my code.
If you think it is worth it, I can move the code to a branch so you can work on it?

@codecov
Copy link

codecov bot commented Nov 1, 2019

Codecov Report

Merging #4021 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4021      +/-   ##
==========================================
+ Coverage   99.41%   99.41%   +<.01%     
==========================================
  Files          72       72              
  Lines       13769    13904     +135     
==========================================
+ Hits        13688    13823     +135     
  Misses         81       81
Impacted Files Coverage Δ
src/init.c 100% <ø> (ø) ⬆️
R/wrappers.R 100% <100%> (ø) ⬆️
src/fifelse.c 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51805cd...3c3ea40. Read the comment docs.

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Show resolved Hide resolved
man/fcase.Rd Outdated Show resolved Hide resolved
src/init.c Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member

to not leave it for too late I would encourage to think how to make evaluation of when lazy, so fcase will actually bring something new to the table. In base R there is switch to do that but it is not vectorized.

switch(1, "a", stop("a"))
#[1] "a"

@jangorecki jangorecki added the WIP label Nov 3, 2019
@2005m
Copy link
Contributor Author

2005m commented Nov 3, 2019

I also wanted to explain how the algorithm works because maybe there is a better way to do it?

  • The i for loop goes through the when and value arguments. The j for loop goes through all the values of when that were not TRUE in the previous i loops. The j loops must be smaller at each iteration of i. To do so, I use a pointer p which is nothing else than a vector of position (see vector tracker). The problem with this approach is that I am not sure we can "parallelize" the j loop....

  • Another approach might be to create a matrix containing all the logical condition and a matrix containing all the output values. Then we can loop through the logical matrix and when a TRUE value is found we assign the corresponding value from the "value" matrix. The problem here is memory usage. The logical and value matrix can become huge if we have large vectors and a lot of conditions.

  • We can simply use the nested if else approach, which is equivalent to adding a loop on top of fifelse.

I hope it makes sense. If not, excuse my French! :-)

@2005m
Copy link
Contributor Author

2005m commented Nov 3, 2019

Thinking out loud. Another way could be to split the when vector in n equal part, (n being the number of core) and process each part in parallel using the current fcaseR c function. That should speed up the process in case of large vectors. I think I will try this approach but first I will need to write a c function to split a vector in equal parts or maybe I will use the split R function as a first step.

@2005m
Copy link
Contributor Author

2005m commented Nov 5, 2019

@jangorecki , I think I will probably have to change the code to make fcase lazy. The only way I see to do that would be to use list(...) and .Call. I am not sure to understand how switch manage to achieve lazy evaluation in C. Do you know? Is it because it is a primitive? Do you see any other option?

@jangorecki
Copy link
Member

If there will be no other option, there is always possibility to substitute dots and pass that together with the evaluation frame.

@2005m
Copy link
Contributor Author

2005m commented Dec 12, 2019

@jangorecki , please see if you are happy with the lazy evaluation. See test 2124.72. If you need anything else please let me know. Thank you.

Copy link
Member

@jangorecki jangorecki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting comments for now. Lazy evaluation approach you tried looks very promissing. Any idea how it compares in speed to the previous one?

man/fcase.Rd Outdated Show resolved Hide resolved
R/wrappers.R Outdated Show resolved Hide resolved
@2005m
Copy link
Contributor Author

2005m commented Dec 13, 2019

To answer your question about speed, lazy evaluation produces the same speed. You will note that I have added a break at the end of the first loop which prevents the evaluation of further expressions in case all when conditions have already been matched with a value. This part might speed up a little bit the computation in case the user has passed too many unnecessary when conditions. Finally I have no idea how to apply openmp to this function. I think it is possible to go below the second for 1GB vector in very specific cases which are equivalent to fswitch(x, when, value).

NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
src/fifelse.c Outdated Show resolved Hide resolved
src/fifelse.c Outdated Show resolved Hide resolved
src/fifelse.c Outdated Show resolved Hide resolved
src/fifelse.c Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
man/fcase.Rd Outdated Show resolved Hide resolved
src/fifelse.c Outdated Show resolved Hide resolved
@jangorecki
Copy link
Member

jangorecki commented Dec 13, 2019

Thanks for all adjustments. The more difficult part of the review left, to investigate if tests cover all necessary cases (and eventually confirm that all their output is expected).
Tagging PR for coming milestone, PR is mature enough to resolve it soon.

@jangorecki jangorecki removed the WIP label Dec 13, 2019
@jangorecki jangorecki added this to the 1.12.9 milestone Dec 13, 2019
@jangorecki
Copy link
Member

Any idea if it is do-able to make default argument also lazy? probably not something easy because we early rely on its value in C code. I can imagine user may want to raise error message when none of the conditions have been satisfied (as requested before in #940). If you don't have any good idea on doing default lazy then probably best to skip this now.

@2005m
Copy link
Contributor Author

2005m commented Dec 13, 2019

We can do it but as you said we use default quite early in the C function, so making it lazy would not add much value in my view as opposed to ... for which we skip unnecessary when/value cases. Please let me know if you have other questions and need me to change something.

@2005m
Copy link
Contributor Author

2005m commented Dec 17, 2019

Some further benchmark:

x = sample(1:60,3e8, replace= TRUE) # 1.1GB

# Unit: seconds
#           expr    min     lq     mean   median      uq     max    neval
# case_when       53.05  55.03    57.93    55.72   59.71   66.11        5
# nested_fifelse   9.60  10.77    13.26    10.99   11.12   23.79        5 # setDTthread(1L)
# nested_fifelse   6.76   7.61     7.63     7.69    7.83    8.26        5 # setDTthread(2L)
# nested_fifelse   6.10   6.20     6.46     6.48    6.72    6.81        5 # setDTthread(4L)
# fcase            6.94   7.06     7.43     7.55    7.59    8.02        5

man/fcase.Rd Outdated
}
\arguments{
\item{...}{ A sequence consisting of logical condition (\code{when})-resulting value (\code{value}) \emph{pairs} in the following order \code{when1, value1, when2, value2, ..., whenN, valueN}. Logical conditions \code{when1, when2, ..., whenN} must all have the same length, type and attributes. Each \code{value} may either share length with \code{when} or be length 1. Please see Examples section for further details.}
\item{default}{ Default return value, \code{NA} by default, for when all of the logical conditions \code{when1, when2, ..., whenN} are \code{FALSE} for some entries. }
Copy link
Member

@MichaelChirico MichaelChirico Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when all of the logical conditions when1, when2, ..., whenN are FALSE

FALSE or missing, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right

@jangorecki
Copy link
Member

jangorecki commented Dec 18, 2019

when putting benchmark you can also put a optimistic scenario benchmark (here on github, not in the code) where lazyness feature can be observed (in timings rather than just error).

\seealso{
\code{\link{fifelse}}
}
\examples{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be helpful to have an example where the order of conditions matters

Copy link
Contributor Author

@2005m 2005m Dec 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to understand. The order of conditions always matters (if I am not mistaken)

@MichaelChirico
Copy link
Member

Looks great.

I guess a "cheating" way to show utility of laziness is to have one of the conditions/values set to function() {Sys.sleep(5); rnorm(n)}. Sleep should not happen under laziness

@2005m
Copy link
Contributor Author

2005m commented Dec 18, 2019

@jangorecki, @MichaelChirico , please see below benchmark with lazyness in action:

setDTthreads(4L)
x = sample(1:30, 3e7, replace = TRUE) # 114 MB

microbenchmark::microbenchmark(
  nested_fifelse = fifelse(x < 10L, 0L,
  fifelse(x < 20L, 10L,
  fifelse(x < 30L, 20L,
  fifelse(x < 40L, 30L,
  fifelse(x < 50L, 40L,
  fifelse(x < 60L, 50L,
  fifelse(x > 60L, 60L, NA_integer_
  ))))))),
  fcase = fcase(
    x < 10L, 0L,
    x < 20L, 10L,
    x < 30L, 20L,
    x < 40L, 30L,
    x < 50L, 40L,
    x < 60L, 50L,
    x > 60L, 60L
  ),
  fcase2 = fcase(
    x < 10L, 0L,
    x < 20L, 10L,
    x < 30L, 20L,
    x < 40L, 30L
  ),
  times = 5L,
  unit = "s")
# Unit: seconds
#           expr       min        lq      mean   median       uq      max neval
# nested_fifelse 1.4431070 1.5588002 1.5712342 1.581573 1.595763 1.676928     5
# fcase          0.9279330 0.9678166 0.9846871 0.978591 1.006174 1.042921     5
# fcase2         0.9164689 1.1123252 1.1671696 1.263177 1.265469 1.278408     5

Let me know if you need anything else. tks.

@mattdowle
Copy link
Member

Awesome work everyone!

@mattdowle mattdowle merged commit 5742d79 into Rdatatable:master Dec 19, 2019
@mattdowle mattdowle deleted the myfcase branch December 19, 2019 03:15
@2005m
Copy link
Contributor Author

2005m commented Dec 19, 2019

Thank you. Please let me know your view with regards to issue 4114. I also sent a message to the team with regards to issue 1336. Not sure if you received it. I am interested to look into this file backed features but I Ieed someone to tell me exactly what is required. Happy to work with anyone here! In the meantime I wish you all a happy festive end of year.

@mattdowle
Copy link
Member

mattdowle commented Dec 20, 2019

Seasonal greetings to you too! I just replied on 4114. On 1336, yes that's probably the highest priority after >2bn rows. I think we're all interested in working on it and somebody just needs to make a start. My approach was to do >2bn first as a way to prepare for 1336. Since there's not much point in having 1336 if it can't do >2bn rows. We haven't tried to work together from the start on new features before; it probably needs one person to take the lead by proposing specific code and a plan.

@jangorecki jangorecki modified the milestones: 1.12.11, 1.12.9 May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fcase / case_when function for data.table
4 participants