Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

by = .EACHI key fix 4603 #4917

Merged
merged 12 commits into from
Apr 13, 2021
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@

## BUG FIXES

1. `by=.EACHI` when `i` is keyed but `on=` different columns than `i`'s key could create an invalidly keyed result, [#4603](https://github.com/Rdatatable/data.table/issues/4603) [#4911](https://github.com/Rdatatable/data.table/issues/4911). Thanks to @myoung3 and @adamaltmejd for reporting, and @ColeMiller1 for the PR. An invalid key is where a `data.table` is marked as sorted by the key columns but the data is not sorted by those columns, leading to incorrect results from subsequent queries.

## NOTES

1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
Expand Down
3 changes: 2 additions & 1 deletion R/data.table.R
Original file line number Diff line number Diff line change
Expand Up @@ -1385,7 +1385,8 @@ replace_dot_alias = function(e) {
byval = i
bynames = if (missing(on)) head(key(x),length(leftcols)) else names(on)
allbyvars = NULL
bysameorder = haskey(i) || (is.sorted(f__) && ((roll == FALSE) || length(f__) == 1L)) # Fix for #1010
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the length(f__)==1L too. That was there, iiuc, from before #3443 when is.sorted() returned false for a single NA. Since is.sorted() is true for all length 1 input, the || length(f__)==1L is superfluous now.
Also moved the roll=FALSE first before && is.sorted(f__) to save the call to is.sorted() when roll==TRUE.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chmatch not match too. i can fix if you're AFK now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good spot. done.

Copy link
Member

@mattdowle mattdowle Apr 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, I thought we had a isLeadingSubsetOfKey() helper function, or similar name, as we do that op quite a bit. But when I did a few greps I couldn't find anything.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was exactly thinking that we should just have a helper for this by now...

bysameorder = (haskey(i) && identical(leftcols, chmatch(head(key(i),length(leftcols)), names(i)))) || # leftcols leading subset of key(i); see #4917
(roll==FALSE && is.sorted(f__)) # roll==FALSE is fix for #1010
## 'av' correct here ?? *** TO DO ***
xjisvars = intersect(av, names_x[rightcols]) # no "x." for xvars.
# if 'get' is in 'av' use all cols in 'i', fix for bug #34
Expand Down
22 changes: 22 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -17273,3 +17273,25 @@ if (test_bit64) {
test(2164.3, d[, mean(b, na.rm=TRUE), by=a], data.table(a=INT(1,2), V1=c(2.5, 4)))
}

# invalid key when by=.EACHI, haskey(i) but on= non-leading-subset of i's key, #4603 #4911
X = data.table(id = c(6456372L, 6456372L, 6456372L, 6456372L,6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L, 6456372L),
id_round = c(197801L, 199405L, 199501L, 197901L, 197905L, 198001L, 198005L, 198101L, 198105L, 198201L, 198205L, 198301L, 198305L, 198401L),
field = c(NA, NA, NA, "medicine", "medicine", "medicine", "medicine", "medicine", "medicine", "medicine", "medicine", "medicine", "medicine", "medicine"),
key = "id")
Y = data.table(id = c(6456372L, 6456345L, 6456356L),
id_round = c(197705L, 197905L, 201705L),
field = c("medicine", "teaching", "health"),
prio = c(6L, 1L, 10L),
key = c("id_round", "id", "prio", "field" ))
test(2165.1, X[Y, on = .(id, id_round > id_round, field), .(x.id_round[1], i.id_round[1]), by=.EACHI][id==6456372L],
data.table(id=6456372L, id_round=197705L, field='medicine', V1=197901L, V2=197705L))
# Y$id_round happens to be sorted, so in 2165.2 we test Y$field which is not sorted
test(2165.2, X[Y, on="field", .(x.id_round[1]), by=.EACHI][field=="health"],
data.table(field="health", V1=NA_integer_))
# a minimal example too ...
X = data.table(A=c(4L,2L,3L), B=1:3, key="A")
Y = data.table(A=2:1, B=2:3, key=c("B","A"))
test(2165.3, X[Y], data.table(A=2:3, B=2:3, i.A=2:1, key="A")) # keyed
test(2165.4, X[Y, on=.(A)], data.table(A=2:1, B=c(2L,NA), i.B=2:3)) # no key
test(2165.5, X[Y, on=.(A), x.B, by=.EACHI], data.table(A=2:1, x.B=c(2L,NA))) # no key