Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when grouping, use of keyby results in sorted output, inconsistent with dplyr #178

Closed
myoung3 opened this issue Feb 2, 2021 · 2 comments

Comments

@myoung3
Copy link

myoung3 commented Feb 2, 2021

Your use of keyby instead of by is resulting an index vector which is sorted by x. But it's not what you want, since dplyr doesn't sort by it's grouping columns.

library(testthat)
library(dtplyr)
library(dplyr,warn.conflicts=FALSE)
library(data.table,warn.conflicts=FALSE)
test_that("can filter when grouped, x not sorted", {
  dt1 <- lazy_dt(data.table(x = c(3,3,1, 2,1, 2), y = c(3,0,1, 2, 2, 4)), "DT")
  dt2 <- dt1 %>% group_by(x) %>% filter(sum(y) == 3) %>% as_tibble()
  dt3 <- dt1 %>% as_tibble() %>% group_by(x) %>% filter(sum(y) == 3) %>% ungroup() 
  expect_equal(dt3, tibble(x = c(3, 3, 1 ,1), y = c(3, 0, 1, 2)))
  expect_equal(dt2, tibble(x = c(3, 3, 1, 1), y = c(3, 0, 1, 2)))
})
#> Error: Test failed: 'can filter when grouped'
#> * <text>:10: `dt2` not equal to tibble(x = c(3, 3, 1, 1), y = c(3, 0, 1, 2)).
#> Component "x": Mean relative difference: 1
#> Component "y": Mean relative difference: 1.333333

Consider:

library(dtplyr)
library(dplyr,warn.conflicts=FALSE)
library(data.table,warn.conflicts=FALSE)

dt1 <- lazy_dt(data.table(x = c(3,3,1, 2,1, 2), y = c(3,0,1, 2, 2, 4)), "DT")
dt2 <- dt1 %>% group_by(x) %>% filter(sum(y) == 3) 
show_query(dt2)
#> DT[DT[, .I[sum(y) == 3], keyby = .(x)]$V1]

dt1 <- as.data.table(dt1)
dt1[, .I[sum(y)==3], keyby="x"]$V1
#> [1] 3 5 1 2
dt1[, .I[sum(y)==3], by="x"]$V1
#> [1] 1 2 3 5

Originally posted by @myoung3 in #177 (comment)

@myoung3

This comment has been minimized.

@hadley
Copy link
Member

hadley commented Feb 2, 2021

Ah yes, I need to use keyby only when summarising.

(BTW I rarely find session info useful so there's no need to include it unless specifically requested)

@hadley hadley closed this as completed in 0090f84 Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants