Melt corrupts data by using column position instead of name #3487

jsams · 2019-04-02T21:00:29Z

I have a data.table with an occasionally inconsistent order to the columns of the data. Consider a dataset where users a purchasing one of three brands or the "0th" brand, and for simplicity, keep prices constant.

df = data.table(id=1:5,
                brand0=c(0,0,1,0,0),
                brand1=c(1,0,0,0,1),
                brand2=c(0,1,0,0,1),
                brand3=c(0,0,0,1,0),
                price1=rep(0.8, 5),
                price2=rep(0.7, 5),
                price3=rep(0.9, 5),
                price0=0)

melt should be using my variable name indexes for the variable.name, but it seems to use the position they appear in the column order. This is simply an error and results in subtle and silent corruption of data:

dl = melt(df, id.vars=c("id"),
          measure.vars=patterns(purchased="^brand", price="^price"),
          variable.name="brand")
setkey(dl, id, brand)

Notice the brands are now indexed from 1-4 instead of 0-3 as in the original data, annoying, but fixable. However, the columns are mixed and matched based on the order they appear and not based on the names of the columns. Thus, it indicates the first user is purchasing brand 2 at 0.7 instead of brand 1 at 0.8.

re-ordering the columns appears to fix things:

setcolorder(df, c("id", sprintf("brand%d", 0:3), sprintf("price%d", 0:3)))
dl = melt(df, id.vars=c("id"),
          measure.vars=patterns(purchased="^brand", price="^price"),
          variable.name="brand", variable.factor=F)
setkey(dl, id, brand)
dl[, brand := as.integer(brand) - 1]

I'm running R 3.5 and data.table 1.12:

R version 3.5.3 (2019-03-11)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.10

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.3.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mlogit_0.4-1      lmtest_0.9-36     Formula_1.2-3     zoo_1.8-5         foreign_0.8-71    data.table_1.12.0

loaded via a namespace (and not attached):
 [1] lattice_0.20-38 MASS_7.3-51.1   grid_3.5.3      magrittr_1.5    bibtex_0.4.2    Rdpack_0.10-1   stringi_1.4.3   statmod_1.4.30  tools_3.5.3     stringr_1.4.0   compiler_3.5.3 
[12] gbRd_0.4-11

The text was updated successfully, but these errors were encountered:

franknarf1 · 2019-04-02T21:34:27Z

If you want the columns linked up in a particular order, the simplest option for now might be the list syntax:

melt(df, meas = list(purchased = paste0("brand", 0:3), price = paste0("price", 0:3)))

There's a comment here about making sure names match up, btw: #3396 (comment)

Regarding starting from 0 vs 1 in the variable column, there's an open issue about passing custom values #2551

MichaelChirico · 2019-04-03T02:36:27Z

Thanks @franknarf1, I think that covers it. @jsams please follow up on the open issues (even reacting can help), or re-open this issue with some clarification if you disagree this is a duplicate. Thanks!

jsams · 2019-04-03T03:55:38Z

I don't understand why this was closed. This is unambiguously incorrect behavior. I appreciate that there is an alternative way to get the correct output, and I had already noted a different workaround in the original report. That's not the point. The point is that there is data corruption due to mis-implementation of melt/patterns

Thus, it indicates the first user is purchasing brand 2 at 0.7 instead of brand 1 at 0.8.

The referenced feature request is related to naming, but the central point here is that data is being corrupted. A relatively high priority data corruption bug versus a convenience feature seem sufficiently different to warrant keeping this open, or upgrading the other to "bug" in recognition that this function is not behaving appropriately.

MichaelChirico · 2019-04-03T04:00:36Z

It was closed because it's a duplicate. From #3396 (comment) :

@HughParsonage patterns doesn't implicitly match on column names.. it just resolves columns into a list of column names/indices for every pattern provided. So I wouldn't classify it as a bug.

What you require could be achieved by extending the functionality of patterns (or even another function), and I think it's extremely useful too..

jsams · 2019-04-03T04:33:35Z

I was taking the other issue as the mapping of the variable names annoyance. If you want to treat it as a duplicate, that's fine, but the other should be raised to a bug. Having a dev say "oh, the function ignores variable names and just uses the indices, so it's not a bug" does not make it not a bug, it must bely that the dev must not have understood the implications of what they were saying.

Quoting Arun from the other report:

There's no reason why patterns should automatically match measure vars automatically.

This is corrupting data. That should absolutely be enough reason.

To be clear if it wasn't in the OP, if I had a columns: id, brand1, brand2, price2, price1, this is mapping brand1 -> price2 and brand2 -> price1. There's no user who wants that to happen. That is a logically incorrect thing to do.

I'm fine with, if sometimes annoyed by, missing features. However, corrupting my data is not ok. If patterns() doesn't work with melt and it's too much work to fix it, then that's completely understandable.* Pull it. But don't leave around functionality that corrupts data and hope people notice it. Then I stop trusting the software.

* First thought suggestion on making it feasible is to require a grouping expression in the arguments of patterns. e.g. above would be "brand(.*)" and "price(.*)", then the matching of the measure vars would be based on the matching of the groups, which is then the user responsibility to get right, and not particularly difficult to get right in most circumstances.

UweBlock · 2019-06-19T09:19:00Z

IMHO, using the convience function pattern() together with disordered columns leads on the wrong track, here.

You can get the correct result if you specify the column names directly as a list to measure.vars (without using patterns() and setcolorder()):

dl = melt(df, id.vars = c("id"),
          measure.vars = list(purchased = sprintf("brand%d", 0:3), 
                              price = sprintf("price%d", 0:3)),
          variable.name = "brand")
dl[order(id, brand)]
#     id brand purchased price
#  1:  1     1         0   0.0
#  2:  1     2         1   0.8
#  3:  1     3         0   0.7
#  4:  1     4         0   0.9
#  5:  2     1         0   0.0
#  6:  2     2         0   0.8
#  7:  2     3         1   0.7
#  8:  2     4         0   0.9
#  9:  3     1         1   0.0
# 10:  3     2         0   0.8
# 11:  3     3         0   0.7
# 12:  3     4         0   0.9
# 13:  4     1         0   0.0
# 14:  4     2         0   0.8
# 15:  4     3         0   0.7
# 16:  4     4         1   0.9
# 17:  5     1         0   0.0
# 18:  5     2         1   0.8
# 19:  5     3         1   0.7
# 20:  5     4         0   0.9

This works perfectly also for column names where the nameing scheme does not suggest a particular order, e.g.,

w1 <-data.table(foo = 10L, bar = 3.14159, baz = 20L, qux = 2.71828)
w1
#    foo     bar baz     qux
# 1:  10 3.14159  20 2.71828

melt(w1, measure.vars = list(c("foo", "baz"), c("bar", "qux")))
#    variable value1  value2
# 1:        1     10 3.14159
# 2:        2     20 2.71828

or with disordered columns

w2 <-data.table(foo = 10L, baz = 20L, qux = 2.71828, bar = 3.14159)
w2
#    foo baz     qux     bar
# 1:  10  20 2.71828 3.14159

melt(w2, measure.vars = list(id = c("foo", "baz"), val = c("bar", "qux")))
#    variable id     val
# 1:        1 10 3.14159
# 2:        2 20 2.71828

The result is the same (except that I have renamed the molten value columns).

For properly ordered columns, the convenience function patterns() will work as expected:

melt(w1, measure.vars = patterns(id = "foo|baz", val = "bar|qux"))
#    variable id     val
# 1:        1 10 3.14159
# 2:        2 20 2.71828

but will fail for disordered columns of course because it has no information about the column order

melt(w2, measure.vars = patterns(id = "foo|baz", val = "bar|qux"))
#    variable id     val
# 1:        1 10 2.71828
# 2:        2 20 3.14159

MichaelChirico closed this as completed Apr 3, 2019

franknarf1 mentioned this issue Apr 3, 2019

melt.data.table should offer variable to match on the name, rather than the number #3396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Melt corrupts data by using column position instead of name #3487

Melt corrupts data by using column position instead of name #3487

jsams commented Apr 2, 2019

franknarf1 commented Apr 2, 2019

MichaelChirico commented Apr 3, 2019

jsams commented Apr 3, 2019

MichaelChirico commented Apr 3, 2019

jsams commented Apr 3, 2019 •

edited

Loading

UweBlock commented Jun 19, 2019

Melt corrupts data by using column position instead of name #3487

Melt corrupts data by using column position instead of name #3487

Comments

jsams commented Apr 2, 2019

franknarf1 commented Apr 2, 2019

MichaelChirico commented Apr 3, 2019

jsams commented Apr 3, 2019

MichaelChirico commented Apr 3, 2019

jsams commented Apr 3, 2019 • edited Loading

UweBlock commented Jun 19, 2019

jsams commented Apr 3, 2019 •

edited

Loading