Skip to content

Revise ?mergelist Details re. join-from/join-to #7190 #7202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

trobx
Copy link

@trobx trobx commented Jul 22, 2025

@trobx trobx requested a review from MichaelChirico as a code owner July 22, 2025 19:52
Copy link

codecov bot commented Jul 22, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.77%. Comparing base (c8bbb58) to head (5e249b9).
Report is 7 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7202      +/-   ##
==========================================
- Coverage   98.77%   98.77%   -0.01%     
==========================================
  Files          81       81              
  Lines       15223    15215       -8     
==========================================
- Hits        15037    15029       -8     
  Misses        186      186              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@trobx
Copy link
Author

trobx commented Jul 22, 2025

Quick explanation:

  • quick heuristic metaphor about join-from searching join-to, but define by substance (relevance to mult and on)
  • use join-from first in all phrasing for clarity, e.g. join-from finds matches in join-to
  • repurpose the bit about on (since now partially already covered) to be about the symmetric treatment in inner/full (new bullet on "mutual mult")

No mention of column or row order at all, per Jan's comments.

man/mergelist.Rd Outdated
\item{ \code{how == "right"}: \emph{join-to} is \emph{LHS}, \emph{join-from} is \emph{RHS}. }
\item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. }
\item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. }
\item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
\item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that each is both \emph{join-from} and \emph{join-to}. }
\item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated equally, so that the \emph{join-from}/\emph{join-to} designation is not so instructive. }

Copy link
Author

@trobx trobx Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico apologies - I did not want to commit this but to discuss. (I'm gitHub-inept and this is my first PR). Have changed it back in a new commit.

Copy link
Author

@trobx trobx Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MichaelChirico The instructive bit is supposed to be that we have a blended/mutual case that x[i] doesn't, but evidently that isn't coming across. Please see what you think of the latest - trying to consolidate/nail the definition in one place.

man/mergelist.Rd Outdated
\item{ When \code{how \%in\% c("inner", full")}, if only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to shorter key. }
Symmetrical \emph{join-from}/\emph{join-to} treatment of \emph{LHS} and \emph{RHS} when \code{how \%in\% c("inner", "full")} is as follows:
\itemize{
\item{ When \code{mult \%in\% c("first", "last", "error")}, then at distinct each value of the join column(s) the rows joined are respectively first-to-first, last-to-last, and only-to-only (or else an error). }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentence not computing, PTAL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - inserted "distinct" in the wrong place. Will address.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think \item should not have { after it since recent version of R

Copy link
Contributor

@aitap aitap Jul 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct! I wonder why it's not caught by any of our R CMD checks:

> tools::checkRd('man/mergelist.Rd')
checkRd: (-3) mergelist.Rd:33: Lost braces
    33 |     \item{ \code{mult} determines the policy when a row of \emph{join-from} finds multiple matches in \emph{join-to}. }
       |          ^
checkRd: (-3) mergelist.Rd:34: Lost braces
    34 |     \item{ When \code{on} is missing, the join column(s) are determined by the key of \emph{join-to}. }
       |          ^
checkRd: (-3) mergelist.Rd:38: Lost braces
    38 |     \item{ \code{how \%in\% c("left", "semi", "anti")}: \emph{join-from} is \emph{LHS}, \emph{join-to} is \emph{RHS}. }
       |          ^
checkRd: (-3) mergelist.Rd:39: Lost braces
    39 |     \item{ \code{how == "right"}: \emph{join-from} is \emph{RHS}, \emph{join-to} is \emph{LHS}. }
       |          ^
checkRd: (-3) mergelist.Rd:40: Lost braces
    40 |     \item{ \code{how \%in\% c("inner", "full")}: \emph{LHS} and \emph{RHS} are treated symmetrically, so that each is both \emph{join-from} and \emph{join-to}; see below. }
       |          ^
checkRd: (-3) mergelist.Rd:41: Lost braces
    41 |     \item{ \code{how == "cross"}: \code{mult} must be \code{"all"} and \code{on} is not used, so the terms are not relevant. }
       |          ^
checkRd: (-3) mergelist.Rd:46: Lost braces
    46 |     \item{ When \code{mult \%in\% c("first", "last", "error")}, then (respectively) the first, last, or only matching row on each side binds with the same on the other. \code{mult} is satisfied mutually and the merge is one-to-one. }
       |          ^
checkRd: (-3) mergelist.Rd:47: Lost braces
    47 |     \item{ If only one table has a key, then this key is used; if both tables have keys, then \code{on = intersect(key(lhs), key(rhs))}, having its order aligned to the shorter key. }
       |          ^
checkRd: (-3) mergelist.Rd:54: Lost braces
    54 |     \item{ When \code{how \%in\% c("left", "inner", "full", "right")}, \code{mult="error"}. }
       |          ^
checkRd: (-3) mergelist.Rd:55: Lost braces
    55 |     \item{ When \code{how \%in\% c("semi", "anti")}, \code{mult="last"}, although this is equivalent to \code{mult="first"}. }
       |          ^
checkRd: (-3) mergelist.Rd:56: Lost braces
    56 |     \item{ When \code{how == "cross"}, \code{mult="all"}. }
       |          ^

(Answer: this is a weak problem of severity "-3", and R CMD check prints errors at severity "-1" and above by default.)


The terms \emph{join-to} and \emph{join-from} indicate which in a pair of tables is the "baseline" or "authoritative" source -- this governs the ordering of rows and columns.
Heuristically speaking, the \emph{join-from} table searches the \emph{join-to} table. More precisely:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it would help to point out to regular users of {data.table} that the familiar x[i, on=...] join has "join-from: i, join-to: x", WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by design it cannot be constant if we have how arg left/right, in [ we just swap tables to achieve that

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we mention it, it should be to point out the difference (@MichaelChirico I think is what you are saying). Maybe, after the "symmetry" point with how="inner":

Note the difference between mergelist with how="inner" and x[i] with nomatch=NULL, where only i is join-from and only x is join-to.

Potentially that could help people twig why this different terminology is needed?

Btw explaining it as two intersections could be useful in e.g. a vignette, though too long for the documentation.

on <- intersect(key(RHS), key(LHS)) # (but also align order to the shorter key)
fintersect(
  RHS[LHS[, on=on, nomatch=NULL, mult="last", <lhs-cols-then-rhs-non-join-cols> ]
  LHS[RHS[, on=on, nomatch=NULL, mult="last"]
)

Though as the note at the bottom says, the second line would be more efficiently done as something like:

RHS[LHS[, last(.SD), by=on], on=on, nomatch=NULL, mult="last", <lhs-cols-then-rhs-non-join-cols> ]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In inner join it is not relevant which is x or i

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly - that's the intended point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Revise explanation of 'join-to'/'join-from' in mergelist documentation
4 participants