Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All non-atomic types are now converted to lists in rbindlist #4196

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

sritchie73
Copy link
Contributor

@sritchie73 sritchie73 commented Jan 24, 2020

Closes #546 and fixes #3811

Example:

> A = data.table(c1 = 1, c2 = 'asd', c3 = expression(as.character(Sys.time())))
> B = data.table(c1 = 3, c2 = 'qwe', c3 = expression(as.character(Sys.time()+5)))
> rbind(A,B)
      c1     c2              c3
   <num> <char>          <list>
1:     1    asd <expression[1]>
2:     3    qwe <expression[1]>
> rbind(A,B)$c3
[[1]]
expression(as.character(Sys.time()))

[[2]]
expression(as.character(Sys.time() + 5))

A consequence of this PR is that where a data.table has > 1 row and an expression column, each expression is separated out into its own list element:

> A = data.table(c1 = 1, c2 = 'asd', c3 = expression(as.character(Sys.time())))
> B = data.table(c1 = 3:5, c2 = c('qwe', "foo", "bar"), 
+                c3 = expression(1+1, print("test"), as.character(Sys.time()+5)))
> rbind(A,B)
      c1     c2              c3
   <num> <char>          <list>
1:     1    asd <expression[1]>
2:     3    qwe <expression[1]>
3:     4    foo <expression[1]>
4:     5    bar <expression[1]>
> rbind(A,B)$c3
[[1]]
expression(as.character(Sys.time()))

[[2]]
expression(1 + 1)

[[3]]
expression(print("test"))

[[4]]
expression(as.character(Sys.time() + 5))

I don't know if this is desired behaviour, or whether we should default back to base R's coercion rules (i.e. returning an expression vector) - presumably if the user has constructed an expression vector column of length > 1 they would desire an expression vector back? E.g. #4040 is requesting expression columns to be preserved.

Any non-allowed column not listed in the typeorder in src/init.c are now appropriately detected in src/rbindlist.c. In these cases, the column type becomes a list, with each element wrapped in a list.
src/rbindlist.c Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 24, 2020

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (c005296) 99.61% compared to head (00c6d85) 99.59%.
Report is 714 commits behind head on master.

Files Patch % Lines
src/utils.c 95.23% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4196      +/-   ##
==========================================
- Coverage   99.61%   99.59%   -0.02%     
==========================================
  Files          72       72              
  Lines       13871    13938      +67     
==========================================
+ Hits        13817    13882      +65     
- Misses         54       56       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/rbindlist.c Outdated Show resolved Hide resolved
@sritchie73 sritchie73 added the WIP label Jan 24, 2020
src/rbindlist.c Outdated
@@ -514,9 +514,33 @@ SEXP rbindlist(SEXP l, SEXP usenamesArg, SEXP fillArg, SEXP idcolArg)
if (w==-1 || !length(thisCol=VECTOR_ELT(li, w))) { // !length for zeroCol warning above; #1871
writeNA(target, ansloc, thisnrow); // writeNA is integer64 aware and writes INT64_MIN
} else {
if ((TYPEOF(target)==VECSXP || TYPEOF(target)==EXPRSXP) && TYPEOF(thisCol)!=TYPEOF(target)) {
if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the desired behaviour here is for a pairlist to be converted to a list, rather than being converted to a list where each element is a 1-element pairlist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example of both would help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense - my understanding is a pairlist is a linked-list, while a regular list is essentially a vector of pointers to the list elements. If we treat a LISTSXP the same way as an EXPRSXP, then you end up with a list, where each element is a pointer to each individual node in the linked list. You end up lose any benefits of having the linked list, because each list element is just a pointer to a 1-node linked list.

E.g the result is something like:

# Linked list, A -> B -> C
ll = pairlist(A=1, B=pairlist(1, 2:3), C='foo')

# Current rbindlist behaviour turns top level pairlist into list (losing the names)
cb = list(1, pairlist(1, 2:3), 'foo')

# Alternate rbindlist behaviour if we treat a pairlist in the same way as expression:
ab = list(pairlist(A=1), pairlist(B=pairlist(1, 2:3)), pairlist(C='foo'))

src/rbindlist.c Outdated
SEXP thisColList = PROTECT(allocVector(VECSXP, length(thisCol))); nprotect++;
for(int r=0; r<length(thisCol); ++r) {
SET_VECTOR_ELT(thisColList, r, thisCol);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Demonstration of the recycling issue and why we need this for loop:

> A = data.table(a=0L, B=quote(1+1))
> A
       a      B
   <int> <call>
1:     0  1 + 1
2:     0  1 + 1
3:     0  1 + 1

The for loop means we get the right result with rbind:

> rbind(A)
       a         B
   <int>    <list>
1:     0 <call[3]>
2:     0 <call[3]>
3:     0 <call[3]>

Otherwise we would have

> rbind(A)
       a         B
   <int>    <list>
1:     0 <call[3]>
2:     0 
3:     0

It would be good to fix this in data.table()/as.data.table() so that these non-vector types are wrapped in list at the time of data.table construction

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, very nice demo

@sritchie73 sritchie73 changed the title Expression columns now supported by rbindlist All exotic types are now converted to lists in rbindlist Jan 24, 2020
@sritchie73
Copy link
Contributor Author

I've also come across this unexpected behaviour:

> A = data.table(c1=0, c2=list(1:3))
> B = data.table(c1=1)
> rbind(A,B,fill=TRUE)
      c1     c2
   <num> <list>
1:     0  1,2,3
2:     1       

When fill=TRUE and the missing column is list type, then the corresponding entries are filled with NULL elements. Should these be filled with NA_logical_ instead?.

@sritchie73 sritchie73 changed the title All exotic types are now converted to lists in rbindlist All non-atomic types are now converted to lists in rbindlist Jan 25, 2020
src/rbindlist.c Outdated
// Exotic non-atomic types need each element to be wrapped in a list, e.g. expression vectors #546
if ((TYPEOF(target)==VECSXP) && (isVectorAtomic(thisCol) || TYPEOF(thisCol)==LISTSXP)) {
// do an as.list() on the atomic column; #3528
// pairlists (LISTSXP) can also be coerced to lists using coerceVector
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it coerce recursively unwrapping nested parilists into single list of many elements?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the top level pairlist is coerced to a list:

> A = data.table(c1 = 1:3, c2 = pairlist(A=1, B=list(1, 2:3), C=pairlist(C1=1, C2=3:4)))
> B = data.table(c1 = 2, c2 = 'asd')
> rbind(A,B)
      c1            c2
   <num>        <list>
1:     1             1
2:     2     <list[2]>
3:     3 <pairlist[2]>
4:     2           asd

@jangorecki
Copy link
Member

Filling with NULL in such a case looks seems fine for me.

@sritchie73 sritchie73 removed the WIP label Jan 25, 2020
@sritchie73
Copy link
Contributor Author

It departs from the expected behaviour of fill=TRUE for all other column types though.

@sritchie73
Copy link
Contributor Author

TODO: list wrapper of atomic vectors is not class aware.

@sritchie73
Copy link
Contributor Author

Wrapping atomic vectors in list now preserves class. The integer64 case errors because memrecycle copies the integer64 class to the list

TODO: fix or handle

@sritchie73 sritchie73 removed the WIP label Jan 27, 2020
@sritchie73
Copy link
Contributor Author

Wrapping atomic vectors in list now correctly preserves class.

@jangorecki
Copy link
Member

jangorecki commented Apr 5, 2020

does this PR needs update after #546 (comment) ? @sritchie73

@sritchie73
Copy link
Contributor Author

@jangorecki I'm not sure: I think the issue would be that if you do not wrap each expression in a list, then the length of the expression column will not match the length of the other non-expression columns.

@jangorecki
Copy link
Member

jangorecki commented May 18, 2020

unless length matches, that would happen when each row is meant to store a language object, rather than expression.

@jangorecki
Copy link
Member

jangorecki commented May 18, 2020

I was going to provide simple example but I noticed that expression looks to be recycled, making the results not what I really wanted

l=list(x=1:2, y=2:3, expr=expression(1+2,2+3))
sapply(l, length)
#   x    y expr 
#   2    2    2 
as.data.table(l)
#       x     y                     expr
#   <int> <int>                   <expr>
#1:     1     2 expression(1 + 2, 2 + 3)
#2:     2     3 expression(1 + 2, 2 + 3)

but after closer look, it seems that the issue is in print data.table, because expr column is not recycled

as.data.table(l)$expr
expression(1 + 2, 2 + 3)

@MichaelChirico any idea if there is an open issue for printing expression column behavior?

@jangorecki
Copy link
Member

If expression would always store the language objects then using it a column would probably make more sense.
But the following behavior makes me think it is just easier to convert that to list

lapply(expression(1+2,3), class)
#[[1]]
#[1] "call"
#
#[[2]]
#[1] "numeric"

@sritchie73
Copy link
Contributor Author

That's essentially the conclusion I came to for any object that's not considered internally to be a vector by R (i.e. !isVector() == true in the R's C interface) because they have a length property that does not correspond to the number of elements. This causes problems when trying to print, and also errors when these types are in the first column of the data.table (because the first column is used to determine .N)

@jangorecki
Copy link
Member

Agree, could you please ensure PR is ready to review? otherwise better to mark it as WIP.

@sritchie73
Copy link
Contributor Author

I've just reviewed the unit tests and run them; the current behaviour of this PR is to wrap each expression element in a list, even if the input expression column is a vector of length > 1:

A = data.table(c1 = 1:2, c2 = c('asd', 'xzy'), c3 = expression(as.character(Sys.time()), 1+2))
B = data.table(c1 = 3:5, c2 = c('qwe', "foo", "bar"), 
               c3 = expression(1+1, print("test"), as.character(Sys.time()+5)))
dt = rbind(A,B)
dt[, c3]
[[1]]
expression(as.character(Sys.time()))

[[2]]
expression(1 + 2)

[[3]]
expression(1 + 1)

[[4]]
expression(print("test"))

[[5]]
expression(as.character(Sys.time() + 5))

I've added a unit test to cover this case.

@MichaelChirico
Copy link
Member

MichaelChirico commented May 19, 2020

any idea if there is an open issue for printing expression column behavior?

Sounds somewhat familiar (it's related to #3011) but I didn't see anything in debugonce.

What's happening is format(l$expr) returns length-one, then cbind recycles this.

This will be better addressed by #3414 -- in fact I already see a format_col.expression method there.

@sritchie73
Copy link
Contributor Author

I'll need to do some digging later. I don't see how that line isn't covered given all the other lines in that for loop are covered...

@sritchie73
Copy link
Contributor Author

I injected a few Rprintf statements and can confirm that the tests are running code within that loop before and after the line that's supposedly not covered, so I'm not sure what to make of this:

Running this test:

# Test non-atomic columns in rbindlist, e.g. expressions, #546
A = data.table(c1 = 1, c2 = 'asd', c3 = expression(as.character(Sys.time())))
B = data.table(c1 = 3, c2 = 'qwe', c3 = expression(as.character(Sys.time()+5)))
DT = data.table(c1 = c(1, 3), c2 = c("asd", "qwe"), 
                c3 = list(expression(as.character(Sys.time())), expression(as.character(Sys.time() + 5))))
test(2129.1, rbind(A,B), DT)

I get both messages below output to the R console:

      Rprintf("Inside loop\n");
      // Allocate length-1 vector of appropriate type and set it in the right place in the VECSXP container
      SEXP thisElement = PROTECT(allocVector(TYPEOF(x), 1)); (*nprotect)++;
      SET_VECTOR_ELT(coerced, i, thisElement); UNPROTECT(1); (*nprotect)--; // line supposedly not covered 
      Rprintf("After SET_VECTOR_ELT\n");

@MichaelChirico
Copy link
Member

@sritchie73 just checking if this is still needed? If so we can merge to master and resume the review process (nearly 4 years later 😅).

Also if you'd like someone else to take over the PR, do feel free.

cc @ben-schwen since I know you've been working a lot on rbindlist lately.

@jangorecki
Copy link
Member

It may be better to merge other rbindlist PR(s) (Ben had something recently from where we took only regression fix), and once the other one is merged master, then merge master to this one.

@MichaelChirico
Copy link
Member

#5453 is the other pending rbindlist() PR. Let's revisit this after that's addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[R-Forge #5252] rbindlist should support expression columns
3 participants