Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Join using merge #45

Merged
merged 11 commits into from
May 11, 2017
Merged

Join using merge #45

merged 11 commits into from
May 11, 2017

Conversation

christophsax
Copy link
Contributor

@christophsax christophsax commented May 10, 2017

In data.table v1.9.6, merge.data.table has gained a by.x and by.y argument. Also it does not copy anymore, and does not need a key, so using merge to perform joins should be more efficient and more convenient.

From ?merge.data.table:

In versions <= v1.9.4, if the specified columns in by was not the key (or head of the key) of x or y, then a copy is first rekeyed prior to performing the merge. This was less performant and memory inefficient. The concept of secondary keys (implemented in v1.9.4) was used to overcome this limitation from v1.9.6+. No deep copies are made anymore and therefore very performant and memory efficient. Also there is better control for providing the columns to merge on with the help of newly implemented by.x and by.y arguments.

Using column renaming and the on = argument for non-merge joins, semi_join.data.table and anti_join.data.table, saves the need for setkey and copying.

This should fix #20, #21.

@krlmlr, we checked this out together, you may have another look. I had to remove some with = FALSE, which only work for the j, not for the i argument. But also without with = FALSE, the current solution works if a column is named y_labels, so we should be fine. Also nomatch = 0L cannot be used in an anti join.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I have seen you opened two PRs that are based on this PR. This will be difficult to handle. You could add these commits to this PR's branch and push, or wait for this to be merged.

R/joins.R Outdated
if (!identical(by$x, by$y)) {
stop("Data table joins must be on same key", call. = FALSE)
}
y <- dplyr::auto_copy(x, y, copy = copy)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to keep this auto_copy() call.

R/joins.R Outdated
w <- unique(x[y, which = TRUE, allow.cartesian = TRUE])
w <- w[!is.na(w)]
x[w]
y <- as.data.table(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto_copy() should take care of that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without copy = TRUE we still would get:

> library(dplyr)
library(dplyr)
dt <- data.table::data.table(x = 1:3, y = 3:1)
df <- data.frame(x = 3:1, z = 1:3)
anti_join(dt, df)
Joining, by = "x"
Error in `[.data.frame`(y, , by_y, with = FALSE) : 
  unused argument (with = FALSE)

This could be avoided with y <- as.data.table(y)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't #13 also take care of that case? I think it's better to have automatic coercion from data frame to data table under our control, and perhaps give a message.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, will remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That means #14 is not yet fixed but will only be if #13 is done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable.

R/joins.R Outdated
w <- unique(x[y, which = TRUE, allow.cartesian = TRUE])
w <- w[!is.na(w)]
x[w]
y <- as.data.table(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we could refactor the common parts in a function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps with separate templates for merge and non-merge joins? As here?

For the suffix argument, we need to treat them differently anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I was thinking about a simple function that computes filter_y.

Templating: We could also do something like

join_dt <- function(op, suffix = TRUE) {
  if (suffix) suffix_op <- ...
  else suffix_op <- NULL
  template <- substitute({
    ...
    suffix_op
    ...
    op
    ...
  })
  f <- ...
  if (!suffix) formals(f) <- formals(f)[names(formals(f)) != "suffix"]
  ...
}

Both solutions are feasible, but really look too complicated for the task at hand. In the end it seems simpler to admin some duplication and get rid of the templating mechanism. But this doesn't have to be in that PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch that, suffix_op affects op. The conclusion still holds -- let's expand the templates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My last comment may have been ambiguous, sorry for that. I was thinking about expanding the templates to six simple functions that have a few lines of code in common.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done, I also like it better that way. Also added a small test.

@christophsax
Copy link
Contributor Author

Closed the two other PR and will open again if this is merged.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks much simpler now without the templates.

R/joins.R Outdated

#' @rdname join.tbl_dt
anti_join.data.table <- join_dt({x[!y, allow.cartesian = TRUE]})
full_join.data.table <- function(x, y, by = NULL, copy = FALSE, ...){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move the full join above the semi join? I think this is the order we use elsewhere, too.

R/joins.R Outdated
y <- dplyr::auto_copy(x, y, copy = copy)
by_x <- by$x
by_y <- by$y
y_filter <- y[, by_y, with = FALSE]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the first comma still necessary with data.table?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without it, it would take it for the i, not the j argument. So I guess, yes.

R/joins.R Outdated
by_x <- by$x
by_y <- by$y
y_filter <- y[, by_y, with = FALSE]
names(y_filter) <- by_x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps y_trimmed <- trim_y_for_semi_join(y, by), or even a better verb than "trim", instead of y_filter? This removes duplication and adds readability IMO.

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I wouldn't mind adding tests to this PR, so that we're sure it works as expected.

@lionel-: Could you please take a look, too?

R/joins.R Outdated
by <- dplyr::common_by(by, x, y)
y <- dplyr::auto_copy(x, y, copy = copy)
by_x <- by$x
by_x <- by$x
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please double-check, also if we could perhaps write by$x instead? The docs for the on argument suggest that we could also avoid renaming the columns in y altogether:

on <- rlang::set_names(by$y, by$x)
w <- x[!y, which = TRUE, on = on]
...

Would that work with data.table 1.9.6?

dplyr already imports rlang, so no extra dependency added, but we need to explicitly declare the import in DESCRIPTION to be able to use it here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also use @import rlang and call rlang functions unqualified. I have started on a tidyevalish implementation (not fully tidy because we'll need to flatten quosures) a couple weeks ago, so we'll import it in the future anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. I wasn't aware of that. And it was already introduced in data.table 1.9.6.

Since using by$y in trim_y_for_semi_join can be used as well, I removed the one line function altogether.

Also added @import rlang and called set_names unqualified.

@krlmlr krlmlr requested a review from lionel- May 11, 2017 06:33
@lionel-
Copy link
Member

lionel- commented May 11, 2017

Thanks, looks much simpler now without the templates.

Agreed. Note that in cases where it makes sense, you can now do:

expr <- quote(a <- b)
exprs <- list(quote(foo(a)), quote(foo(b)))

fn <- expr_interp(function(...) {
  !! expr
  !!! exprs
})

Copy link
Member

@krlmlr krlmlr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks! Ready to merge if you could please add a bullet to NEWS.

@krlmlr krlmlr merged commit 6c81a9a into tidyverse:master May 11, 2017
@krlmlr
Copy link
Member

krlmlr commented May 11, 2017

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Join with different keys
3 participants