Skip to content

Commit 456a8e6

Browse files
cbindlist
add cbind by reference, timing R prototype of mergelist wording use lower overhead funs stick to int32 for now, correct R_alloc bmerge C refactor for codecov and one loop for speed address revealed codecov gaps refactor vecseq for codecov seqexp helper, some alloccol export on C bmerge codecov, types handled in R bmerge already better comment seqexp bmerge mult=error #655 multiple new C utils swap if branches explain new C utils comments mostly reduce conflicts to PR #4386 comment C code address multiple matches during update-on-join #3747 Revert "address multiple matches during update-on-join #3747" This reverts commit b64c0c3. merge.dt has temporarily mult arg, for testing minor changes to cbindlist c dev mergelist, for single pair now add quiet option to cc() mergelist tests add check for names to perhaps.dt rm mult from merge.dt method rework, clean, polish multer, fix righ and full joins make full join symmetric mergepair inner function to loop on extra check for symmetric mergelist manual ensure no df-dt passed where list expected comments and manual handle 0 cols tables more tests more tests and debugging move more logic closer to bmerge, simplify mergepair more tests revert not used changes reduce not needed checks, cleanup copy arg behavior, manual, no tests yet cbindlist manual, export both cleanup processing bmerge to dtmatch test function match order for easier preview vecseq gets short-circuit batch test allow browser big cleanup remmove unneeded stuff, reduce diff more cleanup, minor manual fixes add proper test scripts Merge branch 'master' into cbind-merge-list comment out not used code for coverage more tests, some nocopy opts rename sql test script, should fix codecov simplify dtmatch inner branch more precise copy, now copy only T or F unused arg not yet in api, wording comments and refer issues codecov hasindex coverage codecov gap tests for join using key, cols argument fix missing import forderv more tests, improve missing on handling more tests for order of inner and full join for long keys new allow.cartesian option, #4383, #914 reduce diff, improve codecov reduce diff, comments need more DT, not lists, mergelist 3+ tbls proper escape heavy check unit tests more tests, address overalloc failure mergelist and cbindlist retain index manual, examples fix manual minor clarify in manual retain keys, right outer join for snowflake schema joins duplicates in cbindlist recycling in cbindlist escape 0 input in copyCols empty input handling closing cbindlist vectorized _on_ and _join.many_ arg rename dtmatch to dtmerge vectorized args: how, mult push down input validation add support for cross join, semi join, anti join full join, reduce overhead for mult=error mult default value dynamic fix manual add "see details" to Rd mention shared on in arg description amend feedback from Michael semi and anti joins will not reorder x columns Merge branch 'master' into cbind-merge-list spelling, thx to @jan-glx check all new funs used and add comments bugfix, sort=T needed for now Merge branch 'master' into cbind-merge-list Update NEWS.md Merge branch 'master' into cbind-merge-list Merge branch 'master' into cbind-merge-list NEWS placement numbering ascArg->order Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list attempt to restore from master Update to stopf() error style Need isFrame for now More quality checks: any(!x)->!all(x); use vapply_1{b,c,i} really restore from master try to PROTECT() before duplicate() update error message in test appease the rchk gods extraneous space missing ';' use catf simplify perhapsDataTableR move sqlite.Rraw.manual into other.Rraw simplify for loop Merge remote-tracking branch 'origin/cbind-merge-list' into cbind-merge-list
1 parent 392a321 commit 456a8e6

File tree

3 files changed

+57
-28
lines changed

3 files changed

+57
-28
lines changed

R/data.table.R

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -199,7 +199,7 @@ replace_dot_alias = function(e) {
199199
}
200200
return(x)
201201
}
202-
if (!mult %chin% c("first","last","all")) stopf("mult argument can only be 'first', 'last' or 'all'")
202+
if (!mult %chin% c("first", "last", "all")) stopf("mult argument can only be 'first', 'last' or 'all'")
203203
missingroll = missing(roll)
204204
if (length(roll)!=1L || is.na(roll)) stopf("roll must be a single TRUE, FALSE, positive/negative integer/double including +Inf and -Inf or 'nearest'")
205205
if (is.character(roll)) {

src/bmerge.c

Lines changed: 44 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -49,8 +49,10 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
4949
// iArg, xArg, icolsArg and xcolsArg
5050
idtVec = SEXPPTR_RO(idt); // set globals so bmerge_r can see them.
5151
xdtVec = SEXPPTR_RO(xdt);
52-
if (!isInteger(icolsArg)) internal_error(__func__, "icols is not integer vector"); // # nocov
53-
if (!isInteger(xcolsArg)) internal_error(__func__, "xcols is not integer vector"); // # nocov
52+
if (!isInteger(icolsArg))
53+
internal_error(__func__, "icols is not integer vector"); // # nocov
54+
if (!isInteger(xcolsArg))
55+
internal_error(__func__, "xcols is not integer vector"); // # nocov
5456
if ((LENGTH(icolsArg)==0 || LENGTH(xcolsArg)==0) && LENGTH(idt)>0) // We let through LENGTH(i) == 0 for tests 2126.*
5557
internal_error(__func__, "icols and xcols must be non-empty integer vectors");
5658
if (LENGTH(icolsArg) > LENGTH(xcolsArg)) internal_error(__func__, "length(icols) [%d] > length(xcols) [%d]", LENGTH(icolsArg), LENGTH(xcolsArg)); // # nocov
@@ -60,25 +62,33 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
6062
iN = ilen = anslen = LENGTH(idt) ? LENGTH(VECTOR_ELT(idt,0)) : 0;
6163
ncol = LENGTH(icolsArg); // there may be more sorted columns in x than involved in the join
6264
for(int col=0; col<ncol; col++) {
63-
if (icols[col]==NA_INTEGER) internal_error(__func__, "icols[%d] is NA", col); // # nocov
64-
if (xcols[col]==NA_INTEGER) internal_error(__func__, "xcols[%d] is NA", col); // # nocov
65-
if (icols[col]>LENGTH(idt) || icols[col]<1) error(_("icols[%d]=%d outside range [1,length(i)=%d]"), col, icols[col], LENGTH(idt));
66-
if (xcols[col]>LENGTH(xdt) || xcols[col]<1) error(_("xcols[%d]=%d outside range [1,length(x)=%d]"), col, xcols[col], LENGTH(xdt));
65+
if (icols[col]==NA_INTEGER)
66+
internal_error(__func__, "icols[%d] is NA", col); // # nocov
67+
if (xcols[col]==NA_INTEGER)
68+
internal_error(__func__, "xcols[%d] is NA", col); // # nocov
69+
if (icols[col]>LENGTH(idt) || icols[col]<1)
70+
error(_("icols[%d]=%d outside range [1,length(i)=%d]"), col, icols[col], LENGTH(idt));
71+
if (xcols[col]>LENGTH(xdt) || xcols[col]<1)
72+
error(_("xcols[%d]=%d outside range [1,length(x)=%d]"), col, xcols[col], LENGTH(xdt));
6773
int it = TYPEOF(VECTOR_ELT(idt, icols[col]-1));
6874
int xt = TYPEOF(VECTOR_ELT(xdt, xcols[col]-1));
69-
if (iN && it!=xt) error(_("typeof x.%s (%s) != typeof i.%s (%s)"), CHAR(STRING_ELT(getAttrib(xdt,R_NamesSymbol),xcols[col]-1)), type2char(xt), CHAR(STRING_ELT(getAttrib(idt,R_NamesSymbol),icols[col]-1)), type2char(it));
75+
if (iN && it!=xt)
76+
error(_("typeof x.%s (%s) != typeof i.%s (%s)"), CHAR(STRING_ELT(getAttrib(xdt,R_NamesSymbol),xcols[col]-1)), type2char(xt), CHAR(STRING_ELT(getAttrib(idt,R_NamesSymbol),icols[col]-1)), type2char(it));
7077
if (iN && it!=LGLSXP && it!=INTSXP && it!=REALSXP && it!=STRSXP)
7178
error(_("Type '%s' is not supported for joining/merging"), type2char(it));
7279
}
7380

7481
// rollArg, rollendsArg
7582
roll = 0.0; rollToNearest = FALSE;
7683
if (isString(rollarg)) {
77-
if (strcmp(CHAR(STRING_ELT(rollarg,0)),"nearest") != 0) error(_("roll is character but not 'nearest'"));
78-
if (ncol>0 && TYPEOF(VECTOR_ELT(idt, icols[ncol-1]-1))==STRSXP) error(_("roll='nearest' can't be applied to a character column, yet."));
84+
if (strcmp(CHAR(STRING_ELT(rollarg,0)),"nearest") != 0)
85+
error(_("roll is character but not 'nearest'"));
86+
if (ncol>0 && TYPEOF(VECTOR_ELT(idt, icols[ncol-1]-1))==STRSXP)
87+
error(_("roll='nearest' can't be applied to a character column, yet."));
7988
roll=1.0; rollToNearest=TRUE; // the 1.0 here is just any non-0.0, so roll!=0.0 can be used later
8089
} else {
81-
if (!isReal(rollarg)) internal_error(__func__, "roll is not character or double"); // # nocov
90+
if (!isReal(rollarg))
91+
internal_error(__func__, "roll is not character or double"); // # nocov
8292
roll = REAL(rollarg)[0]; // more common case (rolling forwards or backwards) or no roll when 0.0
8393
}
8494
rollabs = fabs(roll);
@@ -97,10 +107,14 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
97107
}
98108

99109
// mult arg
100-
if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "all")) mult = ALL;
101-
else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "first")) mult = FIRST;
102-
else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "last")) mult = LAST;
103-
else internal_error(__func__, "invalid value for 'mult'"); // # nocov
110+
if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "all"))
111+
mult = ALL;
112+
else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "first"))
113+
mult = FIRST;
114+
else if (!strcmp(CHAR(STRING_ELT(multArg, 0)), "last"))
115+
mult = LAST;
116+
else
117+
internal_error(__func__, "invalid value for 'mult'"); // # nocov
104118

105119
// opArg
106120
if (!isInteger(opArg) || length(opArg)!=ncol)
@@ -131,7 +145,8 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
131145
retLength = R_Calloc(anslen, int);
132146
retIndex = R_Calloc(anslen, int);
133147
// initialise retIndex here directly, as next loop is meant for both equi and non-equi joins
134-
for (int j=0; j<anslen; j++) retIndex[j] = j+1;
148+
for (int j=0; j<anslen; j++)
149+
retIndex[j] = j+1;
135150
} else { // equi joins (or) non-equi join but no multiple matches
136151
retFirstArg = PROTECT(allocVector(INTSXP, anslen));
137152
retFirst = INTEGER(retFirstArg);
@@ -144,9 +159,11 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
144159
for (int j=0; j<anslen; j++) {
145160
// defaults need to populated here as bmerge_r may well not touch many locations, say if the last row of i is before the first row of x.
146161
retFirst[j] = nomatch; // default to no match for NA goto below
147-
// retLength[j] = 0; // TO DO: do this to save the branch below and later branches at R level to set .N to 0
148-
retLength[j] = nomatch==0 ? 0 : 1;
149162
}
163+
// retLength[j] = 0; // TO DO: do this to save the branch below and later branches at R level to set .N to 0
164+
int retLengthVal = (int)(nomatch != 0);
165+
for (int j=0; j<anslen; j++)
166+
retLength[j] = retLengthVal;
150167

151168
// allLen1Arg
152169
allLen1Arg = PROTECT(allocVector(LGLSXP, 1));
@@ -172,7 +189,8 @@ SEXP bmerge(SEXP idt, SEXP xdt, SEXP icolsArg, SEXP xcolsArg, SEXP xoArg, SEXP r
172189
// xo arg
173190
xo = NULL;
174191
if (length(xoArg)) {
175-
if (!isInteger(xoArg)) internal_error(__func__, "xoArg is not an integer vector"); // # nocov
192+
if (!isInteger(xoArg))
193+
internal_error(__func__, "xoArg is not an integer vector"); // # nocov
176194
xo = INTEGER(xoArg);
177195
}
178196

@@ -389,10 +407,15 @@ void bmerge_r(int xlowIn, int xuppIn, int ilowIn, int iuppIn, int col, int thisg
389407
// final two 1's are lowmax and uppmax
390408
} else {
391409
int len = xupp-xlow-1+rollLow+rollUpp; // rollLow and rollUpp cannot both be true
392-
if (mult==ALL && len>1) allLen1[0] = FALSE;
410+
if (len>1) {
411+
if (mult==ALL)
412+
allLen1[0] = FALSE; // bmerge()$allLen1
413+
else if (mult==ERR)
414+
error("mult='error' and multiple matches during merge");
415+
}
393416
if (nqmaxgrp == 1) {
394-
const int rf = (mult!=LAST) ? xlow+2-rollLow : xupp+rollUpp; // extra +1 for 1-based indexing at R level
395-
const int rl = (mult==ALL) ? len : 1;
417+
const int rf = (mult!=LAST) ? xlow+2-rollLow : xupp+rollUpp; // bmerge()$starts thus extra +1 for 1-based indexing at R level
418+
const int rl = (mult==ALL) ? len : 1; // bmerge()$lens
396419
for (int j=ilow+1; j<iupp; j++) { // usually iterates once only for j=ir
397420
const int k = o ? o[j]-1 : j;
398421
retFirst[k] = rf;

src/vecseq.c

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,12 @@ SEXP vecseq(SEXP x, SEXP len, SEXP clamp)
1010
// Specially for use by [.data.table after binary search. Now so specialized that for general use
1111
// bit::vecseq is recommended (Jens has coded it in C now).
1212

13-
if (!isInteger(x)) error(_("x must be an integer vector"));
14-
if (!isInteger(len)) error(_("len must be an integer vector"));
15-
if (LENGTH(x) != LENGTH(len)) error(_("x and len must be the same length"));
13+
if (!isInteger(x))
14+
error(_("x must be an integer vector")); // # nocov
15+
if (!isInteger(len))
16+
error(_("len must be an integer vector")); // # nocov
17+
if (LENGTH(x) != LENGTH(len))
18+
error(_("x and len must be the same length")); // # nocov
1619
const int *ix = INTEGER(x);
1720
const int *ilen = INTEGER(len), nlen=LENGTH(len);
1821
int reslen = 0;
@@ -22,10 +25,13 @@ SEXP vecseq(SEXP x, SEXP len, SEXP clamp)
2225
reslen += ilen[i];
2326
}
2427
if (!isNull(clamp)) {
25-
if (!isNumeric(clamp) || LENGTH(clamp)!=1) error(_("clamp must be a double vector length 1"));
28+
if (!isNumeric(clamp) || LENGTH(clamp)!=1)
29+
error(_("clamp must be a double vector length 1")); // # nocov
2630
double limit = REAL(clamp)[0];
27-
if (limit<0) error(_("clamp must be positive"));
28-
if (reslen>limit) error(_("Join results in %d rows; more than %d = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."), reslen, (int)limit);
31+
if (limit<0)
32+
error(_("clamp must be positive")); // # nocov
33+
if (reslen>limit)
34+
error(_("Join results in %d rows; more than %d = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice."), reslen, (int)limit);
2935
}
3036
SEXP ans = PROTECT(allocVector(INTSXP, reslen));
3137
int *ians = INTEGER(ans);

0 commit comments

Comments
 (0)