Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"nomatch is ignored with :=" warning is misleading #2180

Open
rballentine opened this issue May 25, 2017 · 4 comments
Open

"nomatch is ignored with :=" warning is misleading #2180

rballentine opened this issue May 25, 2017 · 4 comments

Comments

@rballentine
Copy link

Starting in v1.9.8 this warning appears when using := with nomatch=0:

nomatch isn't relevant together with :=, ignoring nomatch

However, when I removed all the nomatch=0 from := calls in my code it caused me some issues. So I thought it was worth pointing out that the warning is not strictly true (nomatch might still be relevant in := calls; nomatch is not ignored). An illustrative example run on v1.10.4:

DT<-data.table(x=2:5)
X<-data.table(x=1:4)

DT[X, on='x', y:={ print(length(x)) ; x }]
# [1] 4

DT[X, on='x', y:={ print(length(x)) ; x }, nomatch=0]
# [1] 3
# Warning message:
# In `[.data.table`(DT, X, on = "x", `:=`(y, { :
#   nomatch isn't relevant together with :=, ignoring nomatch

Above, the resulting data.table is the same, but nomatch controls whether the elimination of unmatched rows occurs in j or only on assignment. This can be relevant: it caused problems in my code when I used vectors in j that weren't in either data.table (x or i). For example:

DT[X, on='x', y:=1:3 + x, nomatch=0][]
# assume this is desired behavior
#    x  y
# 1: 2  3
# 2: 3  5
# 3: 4  7
# 4: 5 NA
# Warning message:
# In `[.data.table`(X, DT, on = "x", `:=`(y, 1:3 + x), nomatch = 0) :
#   nomatch isn't relevant together with :=, ignoring nomatch

DT[X, on='x', y:=1:3 + x][]
# results are incorrect without nomatch=0
#    x  y
# 1: 2  4
# 2: 3  6
# 3: 4  5
# 4: 5 NA
# Warning message:
# In 1:3 + x :
#   longer object length is not a multiple of shorter object length

Thus there remains a use case for using nomatch=0 with :=, and I don't think users should be told that it is not relevant (and definitely shouldn't be told that it is ignored). Could lead to hard-to-find problems if people assume, like I did, that there are no possible repercussions to removing nomatch=0 where := appears in their code.

@franknarf1
Copy link
Contributor

Do you have a better example?

Using x as an on= column in an equi-join and as the critical part of j makes no sense, I guess. Even if it did make sense in some similar case (maybe in a non-equi join?), you would have to use i.x or x.x for me to believe that your computational steps are deliberate and not just coincidentally giving the desired results...

I'm just chiming in because I like seeing that warning and don't see any convincing reason to scrap it (by taking out the key "not relevant" and "ignoring" parts). Maybe something could be added to it, but it's not clear to me what that might be.

@rballentine
Copy link
Author

Sure, that was just a minimal reproducible example of the difference between the two; I agree it doesn't make any sense. Here are three examples of what I imagine are real use-cases.

DT <- data.table(x=1:3, key='x')
X <- data.table(x=1:100, y=1:100, key='x')

# extract some dataset from the overlap between two tables
values <- DT[X, y, nomatch=0]
# do something (assume for some reason i can't/don't want to do this inside j)
values <- values + 1

# this is incorrect (:= expects 100 values, only got 3)
DT[X, z:=values]
# Warning message:
# In `[.data.table`(DT, X, `:=`(z, values)) :
#   Supplied 3 items to be assigned to 100 items of column 'z' (recycled leaving remainder of 1 items).

# this is correct (now := expects 3 values)
DT[X, z:=values, nomatch=0]
# Warning message:
# In `[.data.table`(DT, X, `:=`(z, values), nomatch = 0) :
#   nomatch isn't relevant together with :=, ignoring nomatch

Unnecessary cycles:

longRunningFunc<-function(i) { print(paste('long-running process on', length(i), 'values')) ; i+1 }

# unnecessary extra calculation since only interested in subset of result
DT[X, z:=longRunningFunc(y)]
# [1] "long-running process on 100 values"

# same result, less time
DT[X, z:=longRunningFunc(y), nomatch=0]
# [1] "long-running process on 3 values"
# Warning message:
# In `[.data.table`(DT, X, `:=`(z, longRunningFunc(y)), nomatch = 0) :
#   nomatch isn't relevant together with :=, ignoring nomatch

Dangerous hidden bug:

# different results even though the warning says there is no difference between the two
DT[X, z:=mean(y)]
DT[X, z2:=mean(y), nomatch=0][]
#    x    z z2
# 1: 1 50.5  2
# 2: 2 50.5  2
# 3: 3 50.5  2
# Warning message:
# In `[.data.table`(DT, X, `:=`(z2, mean(y)), nomatch = 0) :
#   nomatch isn't relevant together with :=, ignoring nomatch

@franknarf1
Copy link
Contributor

franknarf1 commented May 31, 2017

@rballentine Ok, I get it now, thanks.

I would lean more towards saying nomatch should take effect before j is computed, so your "unnecessary cycles" and "hidden bug" problems would go away (rather than simply changing the "nomatch/:= irrelevant" message to alert the user). My reasoning is: a data.table call DT[i, j, by] should be readable as "select rows with i; group by by; then do j" (per the front-page stuff in the wiki) and I think nomatch is conceptually part of the i step.

Side note:

DT[X, z := values_from_X ]

is a risky operation, since when there are multiple matches of X to a single row of DT, only one match can be selected (apparently the final match is used right now). For example:

DT2 = data.table(id = 1L)
X2 = data.table(id = 1L, v = 1:2)
values2 = DT2[X2, on=.(id), i.v, nomatch=0]
newvalues2 = values2 + 1L
DT2[X2, on=.(id), nomatch = 0, v := newvalues2][]
#    id v
# 1:  1 3

or the same thing is seen when it's all done inside j: DT2[X2, on=.(id), nomatch = 0, v2 := i.v + 1L ][]

@ProfFancyPants
Copy link

I am uncertain if this relates to the topic, but have been confused why adding the nomatch argument in conjunction with an anti-join produces errors unstead of warnings. Examples:

DT <- data.table(x = -1:3, key = 'x')
X <- data.table(x = 1:100, y = 1:100, key = 'x')
DT[X]
DT[ X, y := y][]
DT[!X, y := y]
DT[!X, y := -1][]
DT[!X, nomatch = 0] ## 1) To warning
DT[!X, nomatch = NA] ## 2) To warning

For situation (1) at worst isn't nomatch = 0 merely redundent? Additionally, is there an argument to be made that situation (2)' nomatch = NA should actually return all the matched 'X' within 'DT' without any joined columns from 'X'? Really, my only issue is with (1) because it slows down interactive programming.

@jangorecki jangorecki added this to the 1.12.9 milestone Apr 6, 2020
@jangorecki jangorecki self-assigned this May 27, 2020
@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@jangorecki jangorecki removed their assignment Nov 6, 2023
@jangorecki jangorecki removed this from the 1.16.0 milestone Nov 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants