Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Melt corrupts data by using column position instead of name #3487

Closed
jsams opened this issue Apr 2, 2019 · 6 comments
Closed

Melt corrupts data by using column position instead of name #3487

jsams opened this issue Apr 2, 2019 · 6 comments

Comments

@jsams
Copy link
Contributor

jsams commented Apr 2, 2019

I have a data.table with an occasionally inconsistent order to the columns of the data. Consider a dataset where users a purchasing one of three brands or the "0th" brand, and for simplicity, keep prices constant.

df = data.table(id=1:5,
                brand0=c(0,0,1,0,0),
                brand1=c(1,0,0,0,1),
                brand2=c(0,1,0,0,1),
                brand3=c(0,0,0,1,0),
                price1=rep(0.8, 5),
                price2=rep(0.7, 5),
                price3=rep(0.9, 5),
                price0=0)

melt should be using my variable name indexes for the variable.name, but it seems to use the position they appear in the column order. This is simply an error and results in subtle and silent corruption of data:

dl = melt(df, id.vars=c("id"),
          measure.vars=patterns(purchased="^brand", price="^price"),
          variable.name="brand")
setkey(dl, id, brand)

Notice the brands are now indexed from 1-4 instead of 0-3 as in the original data, annoying, but fixable. However, the columns are mixed and matched based on the order they appear and not based on the names of the columns. Thus, it indicates the first user is purchasing brand 2 at 0.7 instead of brand 1 at 0.8.

re-ordering the columns appears to fix things:

setcolorder(df, c("id", sprintf("brand%d", 0:3), sprintf("price%d", 0:3)))
dl = melt(df, id.vars=c("id"),
          measure.vars=patterns(purchased="^brand", price="^price"),
          variable.name="brand", variable.factor=F)
setkey(dl, id, brand)
dl[, brand := as.integer(brand) - 1]

I'm running R 3.5 and data.table 1.12:

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlogit_0.4-1      lmtest_0.9-36     Formula_1.2-3     zoo_1.8-5         foreign_0.8-71    data.table_1.12.0

loaded via a namespace (and not attached):
 [1] lattice_0.20-38 MASS_7.3-51.1   grid_3.5.3      magrittr_1.5    bibtex_0.4.2    Rdpack_0.10-1   stringi_1.4.3   statmod_1.4.30  tools_3.5.3     stringr_1.4.0   compiler_3.5.3 
[12] gbRd_0.4-11    
@franknarf1
Copy link
Contributor

If you want the columns linked up in a particular order, the simplest option for now might be the list syntax:

melt(df, meas = list(purchased = paste0("brand", 0:3), price = paste0("price", 0:3)))

There's a comment here about making sure names match up, btw: #3396 (comment)

Regarding starting from 0 vs 1 in the variable column, there's an open issue about passing custom values #2551

@MichaelChirico
Copy link
Member

Thanks @franknarf1, I think that covers it. @jsams please follow up on the open issues (even reacting can help), or re-open this issue with some clarification if you disagree this is a duplicate. Thanks!

@jsams
Copy link
Contributor Author

jsams commented Apr 3, 2019

I don't understand why this was closed. This is unambiguously incorrect behavior. I appreciate that there is an alternative way to get the correct output, and I had already noted a different workaround in the original report. That's not the point. The point is that there is data corruption due to mis-implementation of melt/patterns

Thus, it indicates the first user is purchasing brand 2 at 0.7 instead of brand 1 at 0.8.

The referenced feature request is related to naming, but the central point here is that data is being corrupted. A relatively high priority data corruption bug versus a convenience feature seem sufficiently different to warrant keeping this open, or upgrading the other to "bug" in recognition that this function is not behaving appropriately.

@MichaelChirico
Copy link
Member

It was closed because it's a duplicate. From #3396 (comment) :

@HughParsonage patterns doesn't implicitly match on column names.. it just resolves columns into a list of column names/indices for every pattern provided. So I wouldn't classify it as a bug.

What you require could be achieved by extending the functionality of patterns (or even another function), and I think it's extremely useful too..

@jsams
Copy link
Contributor Author

jsams commented Apr 3, 2019

I was taking the other issue as the mapping of the variable names annoyance. If you want to treat it as a duplicate, that's fine, but the other should be raised to a bug. Having a dev say "oh, the function ignores variable names and just uses the indices, so it's not a bug" does not make it not a bug, it must bely that the dev must not have understood the implications of what they were saying.

Quoting Arun from the other report:

There's no reason why patterns should automatically match measure vars automatically.

This is corrupting data. That should absolutely be enough reason.

To be clear if it wasn't in the OP, if I had a columns: id, brand1, brand2, price2, price1, this is mapping brand1 -> price2 and brand2 -> price1. There's no user who wants that to happen. That is a logically incorrect thing to do.

I'm fine with, if sometimes annoyed by, missing features. However, corrupting my data is not ok. If patterns() doesn't work with melt and it's too much work to fix it, then that's completely understandable.* Pull it. But don't leave around functionality that corrupts data and hope people notice it. Then I stop trusting the software.

* First thought suggestion on making it feasible is to require a grouping expression in the arguments of patterns. e.g. above would be "brand(.*)" and "price(.*)", then the matching of the measure vars would be based on the matching of the groups, which is then the user responsibility to get right, and not particularly difficult to get right in most circumstances.

@UweBlock
Copy link
Contributor

IMHO, using the convience function pattern() together with disordered columns leads on the wrong track, here.

You can get the correct result if you specify the column names directly as a list to measure.vars (without using patterns() and setcolorder()):

dl = melt(df, id.vars = c("id"),
          measure.vars = list(purchased = sprintf("brand%d", 0:3), 
                              price = sprintf("price%d", 0:3)),
          variable.name = "brand")
dl[order(id, brand)]
#     id brand purchased price
#  1:  1     1         0   0.0
#  2:  1     2         1   0.8
#  3:  1     3         0   0.7
#  4:  1     4         0   0.9
#  5:  2     1         0   0.0
#  6:  2     2         0   0.8
#  7:  2     3         1   0.7
#  8:  2     4         0   0.9
#  9:  3     1         1   0.0
# 10:  3     2         0   0.8
# 11:  3     3         0   0.7
# 12:  3     4         0   0.9
# 13:  4     1         0   0.0
# 14:  4     2         0   0.8
# 15:  4     3         0   0.7
# 16:  4     4         1   0.9
# 17:  5     1         0   0.0
# 18:  5     2         1   0.8
# 19:  5     3         1   0.7
# 20:  5     4         0   0.9

This works perfectly also for column names where the nameing scheme does not suggest a particular order, e.g.,

w1 <-data.table(foo = 10L, bar = 3.14159, baz = 20L, qux = 2.71828)
w1
#    foo     bar baz     qux
# 1:  10 3.14159  20 2.71828

melt(w1, measure.vars = list(c("foo", "baz"), c("bar", "qux")))
#    variable value1  value2
# 1:        1     10 3.14159
# 2:        2     20 2.71828

or with disordered columns

w2 <-data.table(foo = 10L, baz = 20L, qux = 2.71828, bar = 3.14159)
w2
#    foo baz     qux     bar
# 1:  10  20 2.71828 3.14159

melt(w2, measure.vars = list(id = c("foo", "baz"), val = c("bar", "qux")))
#    variable id     val
# 1:        1 10 3.14159
# 2:        2 20 2.71828

The result is the same (except that I have renamed the molten value columns).

For properly ordered columns, the convenience function patterns() will work as expected:

melt(w1, measure.vars = patterns(id = "foo|baz", val = "bar|qux"))
#    variable id     val
# 1:        1 10 3.14159
# 2:        2 20 2.71828

but will fail for disordered columns of course because it has no information about the column order

melt(w2, measure.vars = patterns(id = "foo|baz", val = "bar|qux"))
#    variable id     val
# 1:        1 10 2.71828
# 2:        2 20 3.14159

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants