-
Notifications
You must be signed in to change notification settings - Fork 1k
Revise ?mergelist Details re. join-from/join-to #7190 #7202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #7202 +/- ##
==========================================
- Coverage 98.77% 98.77% -0.01%
==========================================
Files 81 81
Lines 15223 15215 -8
==========================================
- Hits 15037 15029 -8
Misses 186 186 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Quick explanation:
No mention of column or row order at all, per Jan's comments. |
man/mergelist.Rd
Outdated
\item{ \code{how == "right"}: \emph{join-to} is \emph{LHS}, \emph{join-from} is \emph{RHS}. } | ||
\item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. } | ||
\item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. } | ||
\item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. } | |
\item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that the \emph{join-from}/\emph{join-to} designation is not so instructive. } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelChirico apologies - I did not want to commit this but to discuss. (I'm gitHub-inept and this is my first PR). Have changed it back in a new commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MichaelChirico The instructive bit is supposed to be that we have a blended/mutual case that x[i]
doesn't, but evidently that isn't coming across. Please see what you think of the latest - trying to consolidate/nail the definition in one place.
man/mergelist.Rd
Outdated
\item{ When \code{how \%in\% c("inner", full")}, if only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to shorter key. } | ||
Symmetrical \emph{join-from}/\emph{join-to} treatment of \emph{LHS} and \emph{RHS} when \code{how \%in\% c("inner", "full")} is as follows: | ||
\itemize{ | ||
\item{ When \code{mult \%in\% c("first", "last", "error")}, then at distinct each value of the join column(s) the rows joined are respectively first-to-first, last-to-last, and only-to-only (or else an error). } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentence not computing, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - inserted "distinct" in the wrong place. Will address.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think \item should not have { after it since recent version of R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct! I wonder why it's not caught by any of our R CMD check
s:
> tools::checkRd('man/mergelist.Rd')
checkRd: (-3) mergelist.Rd:33: Lost braces
33 | \item{ \code{mult} determines the policy when a row of \emph{join-from} finds multiple matches in \emph{join-to}. }
| ^
checkRd: (-3) mergelist.Rd:34: Lost braces
34 | \item{ When \code{on} is missing, the join column(s) are determined by the key of \emph{join-to}. }
| ^
checkRd: (-3) mergelist.Rd:38: Lost braces
38 | \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. }
| ^
checkRd: (-3) mergelist.Rd:39: Lost braces
39 | \item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. }
| ^
checkRd: (-3) mergelist.Rd:40: Lost braces
40 | \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated symmetrically, so that each is both \emph{join-from} and \emph{join-to}; see below. }
| ^
checkRd: (-3) mergelist.Rd:41: Lost braces
41 | \item{ \code{how == "cross"}: \code{mult} must be \code{"all"} and \code{on} is not used, so the terms are not relevant. }
| ^
checkRd: (-3) mergelist.Rd:46: Lost braces
46 | \item{ When \code{mult \%in\% c("first", "last", "error")}, then (respectively) the first, last, or only matching row on each side binds with the same on the other. \code{mult} is satisfied mutually and the merge is one-to-one. }
| ^
checkRd: (-3) mergelist.Rd:47: Lost braces
47 | \item{ If only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to the shorter key. }
| ^
checkRd: (-3) mergelist.Rd:54: Lost braces
54 | \item{ When \code{how \%in\% c("left", "inner", "full", "right")}, \code{mult="error"}. }
| ^
checkRd: (-3) mergelist.Rd:55: Lost braces
55 | \item{ When \code{how \%in\% c("semi", "anti")}, \code{mult="last"}, although this is equivalent to \code{mult="first"}. }
| ^
checkRd: (-3) mergelist.Rd:56: Lost braces
56 | \item{ When \code{how == "cross"}, \code{mult="all"}. }
| ^
(Answer: this is a weak problem of severity "-3", and R CMD check
prints errors at severity "-1" and above by default.)
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
Co-authored-by: Michael Chirico <michaelchirico4@gmail.com>
…ommitting a suggestion.
|
||
The terms \emph{join-to} and \emph{join-from} indicate which in a pair of tables is the "baseline" or "authoritative" source -- this governs the ordering of rows and columns. | ||
Heuristically speaking, the \emph{join-from} table searches the \emph{join-to} table. More precisely: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it would help to point out to regular users of {data.table} that the familiar x[i, on=...]
join has "join-from: i, join-to: x", WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by design it cannot be constant if we have how
arg left/right, in [ we just swap tables to achieve that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we mention it, it should be to point out the difference (@MichaelChirico I think is what you are saying). Maybe, after the "symmetry" point with how="inner"
:
Note the difference between
mergelist
withhow="inner"
andx[i]
withnomatch=NULL
, where onlyi
is join-from and onlyx
is join-to.
Potentially that could help people twig why this different terminology is needed?
Btw explaining it as two intersections could be useful in e.g. a vignette, though too long for the documentation.
on <- intersect(key(RHS), key(LHS)) # (but also align order to the shorter key)
fintersect(
RHS[LHS[, on=on, nomatch=NULL, mult="last", <lhs-cols-then-rhs-non-join-cols> ]
LHS[RHS[, on=on, nomatch=NULL, mult="last"]
)
Though as the note at the bottom says, the second line would be more efficiently done as something like:
RHS[LHS[, last(.SD), by=on], on=on, nomatch=NULL, mult="last", <lhs-cols-then-rhs-non-join-cols> ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In inner join it is not relevant which is x or i
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly - that's the intended point
closes #7190 @MichaelChirico @jangorecki