Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate window functions in favour of {slide} #143

Closed
earowang opened this issue Aug 15, 2019 · 9 comments
Closed

Deprecate window functions in favour of {slide} #143

earowang opened this issue Aug 15, 2019 · 9 comments
Milestone

Comments

@earowang
Copy link
Member

earowang commented Aug 15, 2019

cc @DavisVaughan

I'd be happy to deprecate window functions in favour of slurrr::slide(). I've had a quick look, and seems that slurrr::slide() replace slide(), tile(), stretch() all together by using different parameters?

You'll probably have future_slide() in mind too? living in slurrr or furrr?

Looks like https://github.com/r-lib/funs also plans to implement this set of functions. But Hadley said no plan in the near future. Does it mean slurrr will not be replaced by funs?

What timeline do you think for submitting to CRAN?

@DavisVaughan
Copy link

Awesome! Thanks for being so understanding here.

slurrr::slide() replace slide(), tile(), stretch() all together by using different parameters?

That is correct! It's all unified under the same function.

You'll probably have future_slide() in mind too?

Eventually yea, after the API settles. Probably not in the first release, and yea itll probably have to live somewhere else. Still thinking through how that would work.

Regarding funs, there will probably be very efficient specific roll_mean() / roll_sum().... etc, but it won't be as generic as slide(). If anything, slide() might eventually get absorbed into purrr after it graduates to a stable API.

For CRAN submission, slurrr now relies on the new dev version of vctrs, so it'll be awhile.

Something that isn't in the package (I just figured out the rough version of it yesterday) is the ability to specify an .index, which I think would solve a lot of real life business problems. The index can be anything, and .before and .after would be anything that can be applied like index - before and index + after (so they could be lubridate periods / durations), or you could supply a lambda for more complex things .before = ~.x %m-% months(1)

# Bucket into 3 month chunks: the current month, 1 month after, 1 month before
set.seed(123)
index <- sort(yearmonth(paste0("2019-", sample(month.abb, 10, TRUE))))
index
#>  [1] "2019 Feb" "2019 Mar" "2019 Mar" "2019 Apr" "2019 May" "2019 Jun"
#>  [7] "2019 Jun" "2019 Sep" "2019 Oct" "2019 Nov"

slide_along_impl(index, ~.x, .index = index, .before = 1, .after = 1)
#> [[1]]
#> [1] "2019 Feb" "2019 Mar" "2019 Mar"
#> 
#> [[2]]
#> [1] "2019 Feb" "2019 Mar" "2019 Mar" "2019 Apr"
#> 
#> [[3]]
#> [1] "2019 Feb" "2019 Mar" "2019 Mar" "2019 Apr"
#> 
#> [[4]]
#> [1] "2019 Mar" "2019 Mar" "2019 Apr" "2019 May"
#> 
#> [[5]]
#> [1] "2019 Apr" "2019 May" "2019 Jun" "2019 Jun"
#> 
#> [[6]]
#> [1] "2019 May" "2019 Jun" "2019 Jun"
#> 
#> [[7]]
#> [1] "2019 May" "2019 Jun" "2019 Jun"
#> 
#> [[8]]
#> [1] "2019 Sep" "2019 Oct"
#> 
#> [[9]]
#> [1] "2019 Sep" "2019 Oct" "2019 Nov"
#> 
#> [[10]]
#> [1] "2019 Oct" "2019 Nov"

# Assume these are like prices of cars, in $
# Bucket them into ranges of: current price, $5000 before, $5000 after
set.seed(123)
index <- sort(rnorm(10, mean = 50000, sd = 5000))
index
#>  [1] 43674.69 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44
#>  [8] 52304.58 57793.54 58575.32

slide_along_impl(index, ~.x, .index = index, .before = 5000, .after = 5000)
#> [[1]]
#> [1] 43674.69 46565.74 47197.62 47771.69
#> 
#> [[2]]
#> [1] 43674.69 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44
#> 
#> [[3]]
#> [1] 43674.69 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44
#> 
#> [[4]]
#> [1] 43674.69 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44 52304.58
#> 
#> [[5]]
#> [1] 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44 52304.58
#> 
#> [[6]]
#> [1] 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44 52304.58
#> 
#> [[7]]
#> [1] 46565.74 47197.62 47771.69 48849.11 50352.54 50646.44 52304.58
#> 
#> [[8]]
#> [1] 47771.69 48849.11 50352.54 50646.44 52304.58
#> 
#> [[9]]
#> [1] 57793.54 58575.32
#> 
#> [[10]]
#> [1] 57793.54 58575.32

@earowang
Copy link
Member Author

Thanks. When I try to install slurrr via remotes::install_github(), I encounter the installation error:

Error: package or namespace load failed for ‘slurrr’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/00LOCK-slurrr/00new/slurrr/libs/slurrr.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/00LOCK-slurrr/00new/slurrr/libs/slurrr.so, 6): Symbol not found: _compact_seq
  Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/library/00LOCK-slurrr/00new/slurrr/libs/slurrr.so
  Expected in: flat namespace
 in /Library/Frameworks/R.framework/Versions/3.6/Resources/library/00LOCK-slurrr/00new/slurrr/libs/slurrr.so
Error: loading failed
Execution halted

Do you have any idea how to solve this?

Based on the current slurrr documentation, I find the API a bit confusing, with the .before and .after. For example, how should I set .before and .after to reproduce the tsibble::slide() results below?

tsibble::slide_dbl(1:5, mean, .size = 2)
#> [1]  NA 1.5 2.5 3.5 4.5
tsibble::slide_dbl(1:5, mean, .size = 3, .align = "center")
#> [1] NA  2  3  4 NA

Created on 2019-08-19 by the reprex package (v0.3.0)

@DavisVaughan
Copy link

DavisVaughan commented Aug 19, 2019

Sorry, had something out of sync. Try again?

Also, docs will get better, but here is how I think about the API:

tsibble::slide_dbl(1:5, mean, .size = 2)
#> [1]  NA 1.5 2.5 3.5 4.5

# "1 row before the current row, to 0 rows after the current row"
slurrr::slide_dbl(1:5, mean, .before = 1)
#> [1] 1.0 1.5 2.5 3.5 4.5

# + only complete windows
slurrr::slide_dbl(1:5, mean, .before = 1, .complete = TRUE)
#> [1]  NA 1.5 2.5 3.5 4.5

tsibble::slide_dbl(1:5, mean, .size = 3, .align = "center")
#> [1] NA  2  3  4 NA

# "1 row before the current row, to 1 row after the current row"
slurrr::slide_dbl(1:5, mean, .before = 1, .after = 1, .complete = TRUE)
#> [1] NA  2  3  4 NA

Created on 2019-08-19 by the reprex package (v0.2.1)

It's a fairly direct translation from SQL window functions, so hopefully people familiar with that will instantly understand. After using it for awhile, I find it a bit easier to understand than the alignment argument (and obviously its a bit more flexible). This is a nice intro to the ideas https://dbplyr.tidyverse.org/articles/translation-function.html#window-functions

@earowang
Copy link
Member Author

Thanks. The installation is resolved now.

If "the current row" is rephrased to "the anchor point", .before and .after start making more sense to me. But I don't understand why .complete = TRUE isn't the default. Isn't complete sliding used more often than partial sliding?

Do you have plan for the .fill argument?

tsibble::slide_dbl(1:5, mean, .size = 2, .fill = NULL)
#> [1] 1.5 2.5 3.5 4.5

Perhaps unbounded() is used in the SQL expression, but Inf/-Inf looks more natural in R. Users don't need to remember a new function. Maybe unbounded() doesn't have the sign +/-, which is you're looking for?

I also keep thinking cumulative sliding is to use .after in slide(1:5, ~.x, .after = unbounded()), not .before.

The slurrr::slide() API looks compatible with SQL, you may also like to consider the possibilities for supporting this #109

@DavisVaughan
Copy link

But I don't understand why .complete = TRUE isn't the default. Isn't complete sliding used more often than partial sliding?

I'm still not sure what the right default is. I feel like when I was a practitioner I always was setting partial = TRUE with rollapply(). I can't justify it more than that though. I asked internally and everyone also voted for .complete = FALSE as the default. But one other person on twitter asked for .complete = TRUE.

Do you have plan for the .fill argument?

I think that for slide() and slide_index(), the size of the output should always match .x. So we'd need some alternate API for the effects of .fill = NULL. I just implemented slide_range() locally, which hopefully will be a lower level sliding function that I will also export. If done correctly, I can implement slide_index() on top of it, and it has enough flexibility that some other api could be built on top of it that works more like .fill = NULL. Essentially it lets you specify starts and stops manually, and then the output size is always equal to vec_size(starts) rather than .x
r-lib/slider#11 (comment)

Perhaps unbounded() is used in the SQL expression, but Inf/-Inf looks more natural in R

I thought about this a lot and it essentially came down to the fact that I don't think saying you want "infinite values before the current element" makes much sense (plus yea the +/- thing could be an issue). For example, with .complete = TRUE, you can technically never make a "complete" window if .before = Inf, so the values would always be NULL. With unbounded(), the mental model is slightly different. I know its a bit odd to use a sentinel value like this, but it felt more natural after a few uses.

library(slide)

x <- 1:5

# simulate infinity
# this is fine
slide::slide(x, ~.x, .before = 9999999)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 1 2 3 4
#> 
#> [[5]]
#> [1] 1 2 3 4 5

# (simulate infinity, you can never create a "complete" window
# with infinite values before the current one)
slide::slide(x, ~.x, .before = 9999999, .complete = TRUE)
#> [[1]]
#> NULL
#> 
#> [[2]]
#> NULL
#> 
#> [[3]]
#> NULL
#> 
#> [[4]]
#> NULL
#> 
#> [[5]]
#> NULL

# unbounded() is more like saying "pin the starting value of the window to the 
# first value of `x`
slide::slide(x, ~.x, .before = unbounded(), .complete = TRUE)
#> [[1]]
#> [1] 1
#> 
#> [[2]]
#> [1] 1 2
#> 
#> [[3]]
#> [1] 1 2 3
#> 
#> [[4]]
#> [1] 1 2 3 4
#> 
#> [[5]]
#> [1] 1 2 3 4 5

Created on 2019-08-23 by the reprex package (v0.2.1)

The SQL bits might come into play with slide_by()

@earowang earowang changed the title Deprecate window functions in favour of slurrr Deprecate window functions in favour of {slide} Aug 24, 2019
@earowang
Copy link
Member Author

Pinging @robjhyndman, can you share your experience here? Which one do you use more often in practice, complete or partial sliding?

Regarding the deprecation process, it will not start until the future_slide() is available. Because it doesn't make sense that slide() is deprecated but not future_slide() in tsibble. As of tsibble v0.9.0, I will add a note in documentation to suggest to use {slide}. It will go through the soft-deprecated -> deprecated -> defunct lifecycle for about a year.

@robjhyndman
Copy link
Member

In tsibble terminology, I use slide and stretch much more than tile. slide for things like moving averages and stretch for time series cross-validation. Probably slide is more common. I find these terms more intuitive than complete and partial sliding.

@earowang
Copy link
Member Author

@robjhyndman The default in tsibble::slide() is complete sliding with .partial = FALSE, but slide::slide() defaults to partial sliding with .complete = FALSE. My question is if you use complete sliding more often than partial sliding.

@robjhyndman
Copy link
Member

robjhyndman commented Aug 25, 2019

I would normally use .partial=FALSE. That is also the safer option and the user can override it if they want to.

@earowang earowang added this to the v0.9.0 milestone Sep 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants