Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

melt needs a variable.value parameter. #2065

Closed
skanskan opened this issue Mar 19, 2017 · 9 comments
Closed

melt needs a variable.value parameter. #2065

skanskan opened this issue Mar 19, 2017 · 9 comments

Comments

@skanskan
Copy link

skanskan commented Mar 19, 2017

In order to reshape data.tables such as this one:

mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
ID ZA_2001 ZA_2002 BB_2001 BB_2002 CC_2007
1       1       5       3       6       6
2       2       4       3       6       5
3       3       3       3       6       4
4       4       2       3       6       3
5       5       1       3       6       2

to this one

ID  measure ZA BB CC
1    2001  1  3 NA
1    2002  5  6 NA
1    2007 NA NA  6
2    2001  2  3 NA
2    2002  4  6 NA
2    2007 NA NA  5
3    2001  3  3 NA
3    2002  3  6 NA
3    2007 NA NA  4
4    2001  4  3 NA
4    2002  2  6 NA
4    2007 NA NA  3
5    2001  5  3 NA
5    2002  1  6 NA
5    2007 NA NA  2

and keep the original variables' order I need to do:

idvars =  grep("_20[0-9][0-9]$",names(mydata) , invert = TRUE)
temp <- melt(mydata, id.vars = idvars)  
temp[, `:=`(var = sub("_20[0-9][0-9]$", '', variable), 
measure = sub('.*_', '', variable), variable = NULL)]  
temp[,var:=factor(var, levels=unique(var))]
dcast( temp,   ... ~ var, value.var='value' )

But it needs too many lines of code, a lot of memory, and it forces type conversions.

It would be great if we could just do it using melt:

melt(mydata, measure.vars=patterns("ZA","BB","CC"), value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")   

(or using regex instead).

But it doesn't work, it only produces 10 rows, and it would need an option to specify where to get the year from.

something like variable.values=sub(".*_","",names(mydata) )[-1]

or even better a syntax such as the splitstackshape package:
Reshape(mydata,id.vars="ID",var.stubs =... ,sep="_")
it would be great.

@franknarf1
Copy link
Contributor

Complete your set of columns and it'll work:

cols = setDT(tstrsplit(names(mydata)[-1], "_"))[, 
  CJ(var = V1, year = V2, unique = TRUE)][, 
  paste(var, year, sep="_")]

mydata[, setdiff(cols, names(mydata)) := NA ]
setcolorder(mydata, c("ID", cols))
melt(mydata, measure.vars=patterns("ZA","BB","CC"), value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")
# copied your code for the last line

Sure, you still need to fill in the year values, but that's another issue that's already been filed, I guess.

@skanskan
Copy link
Author

I've found that the melt step in my solution could be a big problem when working with medium-sized datatables.
If you have a data.table with just 100000 rows x 1000 columns and use half of the columns as id.vars the output is 50000000 x 500, just too much to continue.

@skanskan
Copy link
Author

skanskan commented Mar 20, 2017

I guess you can get the years' column with

temp <- melt(mydata, measure.vars=patterns("ZA","BB","CC"),
   value.name =c("ZA","BB","CC"), id.vars="ID", variable.name="year")

temp[,year := rep(unique(sub(".*_", "", names(mydata)[-1])),
   length.out=nrow(temp))]

or also using data from your CJ.
And adding the options sorted=F to CJ.

@skanskan
Copy link
Author

skanskan commented Mar 20, 2017

@franknarf1 , any direct way to get this?

ID measure ZA BB CC
1 2001 1 3 NA
1 2002 5 6 NA
1 2007 NA NA 6
2 2001 2 3 NA
2 2002 4 6 NA
2 2007 NA NA 5
3 2001 3 3 NA
3 2002 3 6 NA
3 2007 NA NA 4
4 2001 4 3 NA
4 2002 2 6 NA
4 2007 NA NA 3
5 2001 5 3 NA
5 2002 1 6 NA
5 2007 NA NA 2

instead of this: (your method)

ID year ZA BB CC
1 2001 1 3 NA
2 2002 2 3 NA
3 2007 3 3 NA
4 2001 4 3 NA
5 2002 5 3 NA
1 2007 5 6 NA
2 2001 4 6 NA
3 2002 3 6 NA
4 2007 2 6 NA
5 2001 1 6 NA
1 2002 NA NA 6
2 2007 NA NA 5
3 2001 NA NA 4
4 2002 NA NA 3
5 2007 NA NA 2

without reordeing the result explicitily?

@franknarf1
Copy link
Contributor

any direct way to get this?

Well, there's

mydata[, melt(.SD, measure.vars=patterns("ZA","BB","CC"),
 value.name =c("ZA","BB","CC"), variable.name="year"), by=ID]

However, doing this by group is a lot less efficient, probably.

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Mar 30, 2017

There are no plans to add new arguments to melt / dcast. However there could be clever ways of extracting levels to variable column automatically from the input pattern.. although I don't see it too often a case.

@tdhock
Copy link
Member

tdhock commented Sep 29, 2020

hi the missing input columns in this melt is similar to #4027 and is fixed by #4720 if you use the new function in nc package (which uses data.table internally), not merged into master yet but you can use it via:

remotes::install_github(c("Rdatatable/data.table@fix4027", "tdhock/nc@multiple-fill"))
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> Skipping install of 'nc' from a github remote, the SHA1 (11b61f8e) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- nc::capture_melt_multiple(
  mydata,
  column=".*",
  "_",
  year="[0-9]+", as.integer,
  fill=TRUE))
#>     ID year BB CC ZA
#>  1:  1 2001  3 NA  1
#>  2:  2 2001  3 NA  2
#>  3:  3 2001  3 NA  3
#>  4:  4 2001  3 NA  4
#>  5:  5 2001  3 NA  5
#>  6:  1 2002  6 NA  5
#>  7:  2 2002  6 NA  4
#>  8:  3 2002  6 NA  3
#>  9:  4 2002  6 NA  2
#> 10:  5 2002  6 NA  1
#> 11:  1 2007 NA  6 NA
#> 12:  2 2007 NA  5 NA
#> 13:  3 2007 NA  4 NA
#> 14:  4 2007 NA  3 NA
#> 15:  5 2007 NA  2 NA
mydata.tall[order(ID, year)]
#>     ID year BB CC ZA
#>  1:  1 2001  3 NA  1
#>  2:  1 2002  6 NA  5
#>  3:  1 2007 NA  6 NA
#>  4:  2 2001  3 NA  2
#>  5:  2 2002  6 NA  4
#>  6:  2 2007 NA  5 NA
#>  7:  3 2001  3 NA  3
#>  8:  3 2002  6 NA  3
#>  9:  3 2007 NA  4 NA
#> 10:  4 2001  3 NA  4
#> 11:  4 2002  6 NA  2
#> 12:  4 2007 NA  3 NA
#> 13:  5 2001  3 NA  5
#> 14:  5 2002  6 NA  1
#> 15:  5 2007 NA  2 NA

@tdhock tdhock closed this as completed Sep 29, 2020
@tdhock
Copy link
Member

tdhock commented Oct 1, 2020

here is another solution using only data.table::melt,

remotes::install_github("Rdatatable/data.table@fix4027")
#> Skipping install of 'data.table' from a github remote, the SHA1 (4c5810c2) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
(mydata.tall <- melt(mydata, measure.vars=list(
  ZA=c("ZA_2001", "ZA_2002", NA),
  BB=c("BB_2001", "BB_2002", NA),
  CC=c(NA, NA, "CC_2007"))))
#>     ID variable ZA BB CC
#>  1:  1        1  1  3 NA
#>  2:  2        1  2  3 NA
#>  3:  3        1  3  3 NA
#>  4:  4        1  4  3 NA
#>  5:  5        1  5  3 NA
#>  6:  1        2  5  6 NA
#>  7:  2        2  4  6 NA
#>  8:  3        2  3  6 NA
#>  9:  4        2  2  6 NA
#> 10:  5        2  1  6 NA
#> 11:  1        3 NA NA  6
#> 12:  2        3 NA NA  5
#> 13:  3        3 NA NA  4
#> 14:  4        3 NA NA  3
#> 15:  5        3 NA NA  2
mydata.tall[, measure := c(2001, 2002, 2007)[variable] ]
mydata.tall[order(ID, measure), .(
  ID, measure, ZA, BB, CC)]
#>     ID measure ZA BB CC
#>  1:  1    2001  1  3 NA
#>  2:  1    2002  5  6 NA
#>  3:  1    2007 NA NA  6
#>  4:  2    2001  2  3 NA
#>  5:  2    2002  4  6 NA
#>  6:  2    2007 NA NA  5
#>  7:  3    2001  3  3 NA
#>  8:  3    2002  3  6 NA
#>  9:  3    2007 NA NA  4
#> 10:  4    2001  4  3 NA
#> 11:  4    2002  2  6 NA
#> 12:  4    2007 NA NA  3
#> 13:  5    2001  5  3 NA
#> 14:  5    2002  1  6 NA
#> 15:  5    2007 NA NA  2

@tdhock
Copy link
Member

tdhock commented Oct 7, 2020

pure data.table solution using #4731

remotes::install_github("Rdatatable/data.table@melt-custom-variable")
#> Skipping install of 'data.table' from a github remote, the SHA1 (c02fa9e8) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(data.table)
mydata <- data.table(ID=1:5, ZA_2001=1:5, ZA_2002=5:1, BB_2001=rep(3,5), BB_2002=rep(6,5), CC_2007=6:2)
melt(mydata, measure.vars=measure(value.name, year))
#>     ID year ZA BB CC
#>  1:  1 2001  1  3 NA
#>  2:  2 2001  2  3 NA
#>  3:  3 2001  3  3 NA
#>  4:  4 2001  4  3 NA
#>  5:  5 2001  5  3 NA
#>  6:  1 2002  5  6 NA
#>  7:  2 2002  4  6 NA
#>  8:  3 2002  3  6 NA
#>  9:  4 2002  2  6 NA
#> 10:  5 2002  1  6 NA
#> 11:  1 2007 NA NA  6
#> 12:  2 2007 NA NA  5
#> 13:  3 2007 NA NA  4
#> 14:  4 2007 NA NA  3
#> 15:  5 2007 NA NA  2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants