Skip to content

Update Documentation for env Parameter Usage* #6360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Conversation

Nj221102
Copy link
Contributor

@Nj221102 Nj221102 commented Aug 6, 2024

closes #6323

Description:

This PR updates the documentation for the env parameter in the data.table package to include new functionality regarding function injection. Initially, the documentation update was focused solely on using function names as strings (e.g., "sum"), as requested in issue #6323. However, following the merge of PR #6026 which allows the direct use of function objects (e.g., sum). This PR updates the documentation to reflect both approaches.

Changes:

  1. Documentation Update:

    • The documentation now includes both methods for injecting functions using the env parameter:
      • Function Names as Strings: Using function names as strings (e.g., "sum").
      • Function Objects Directly: Passing function objects directly (e.g., sum).
  2. Example Block Addition:

    • Added examples demonstrating both approaches for function injection.

@Nj221102 Nj221102 requested review from tdhock and Anirban166 August 6, 2024 19:02
@TysonStanley
Copy link
Member

LGTM

@tdhock
Copy link
Member

tdhock commented Aug 9, 2024

I removed the DT definition since there are already other DT definitions that we can use for this example.
I noticed this would add the first example use of env arg to ?data.table man page.
@jangorecki is there some reason why there were no env examples before?

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 9, 2024

Thanks for the PR! Actually, for addressing #6323 I had in mind extending the programming vignette, rather than ?data.table.

Although as Toby points out, it looks like this is the first example using env= on that page, so it's good to keep the new example.

@Nj221102
Copy link
Contributor Author

Nj221102 commented Aug 9, 2024

Thanks for the PR! Actually, for addressing #6323 I had in mind extending the programming vignette, rather than ?data.table.

Although as Toby points out, it looks like this is the first example using env= on that page, so it's good to keep the new example.

sure, i m updating the programming vignette now.

@jangorecki
Copy link
Member

jangorecki commented Aug 9, 2024

Direct usage of sum rather than "sum" has to be avoided. Please investigate yourself. Take some big R function, and use it that way, verbose should display what's wrong. substitute2 may be even easier to show the problem.

I can imagine it could even potentially fail on trying to build big expression in j.

The only reason to use function name as a symbol rather than string is when we want to inject function body into expression. I see no reasonable use case for that.

Definitly worth to add some examples to manual as well. Now it was only redirecting to vignette and substitute2 manual.

@tdhock
Copy link
Member

tdhock commented Aug 9, 2024

by the way Jan it is really super convenient how you can substitute names via substitute2 / env, thanks for implementing that.

@tdhock
Copy link
Member

tdhock commented Aug 9, 2024

@Nj221102 please also add discussion/example to programming vignette, maybe under ### `get` ?

@jangorecki
Copy link
Member

jangorecki commented Aug 9, 2024

by the way Jan it is really super convenient how you can substitute names via substitute2 / env, thanks for implementing that.

Yes that was the "new" thing vs. substitute or tidy way of meta programming. Afair that was the only part that required to be coded in C.
https://github.com/Rdatatable/data.table/blob/d00f292596d82fd21aac59edc34a9f8b388450f4/src/programming.c

Copy link

codecov bot commented Jul 1, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@1b3998e). Learn more about missing BASE report.
Report is 9 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #6360   +/-   ##
=========================================
  Coverage          ?   98.34%           
=========================================
  Files             ?       81           
  Lines             ?    14927           
  Branches          ?        0           
=========================================
  Hits              ?    14680           
  Misses            ?      247           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@MichaelChirico
Copy link
Member

Direct usage of sum rather than "sum" has to be avoided. Please investigate yourself. Take some big R function, and use it that way, verbose should display what's wrong. substitute2 may be even easier to show the problem.

I can imagine it could even potentially fail on trying to build big expression in j.

The only reason to use function name as a symbol rather than string is when we want to inject function body into expression. I see no reasonable use case for that.

Definitly worth to add some examples to manual as well. Now it was only redirecting to vignette and substitute2 manual.

Elaborated in the original issue (#6323) on why I continue to advocate for direct usage of sum.

For now, I've opted to just put a minimal mention of that to resolve this issue. Beyond that it's something we can continue to take into consideration as we observe user reports/our own usage.

@MichaelChirico MichaelChirico requested a review from tdhock July 1, 2025 06:41
@jangorecki
Copy link
Member

I think there was some misconception at the start of this PR.

Initially, the documentation update was focused solely on using function names as strings (e.g., "sum"), as requested in issue #6323. However, following the merge of PR #6026 which allows the direct use of function objects (e.g., sum).

#6026 says

Note that it's much better to pass functions like f="sum" instead.

For a good reason, because 99.5% of the time we will not want to substitute data.table expression using body of the function.

sum will work but it won't be calling sum anymore, it will be pasting in body of the sum function and evaluating that body.

Therefore I don't think it is reasonable to document that as a way to do it.

I will submit an edit to this PR in a moment which should make everything more clear.

@jangorecki
Copy link
Member

jangorecki commented Jul 1, 2025

Mention of do.call is very on point here

f = function() sys.call()
do.call("f",list())
#f()
do.call(f,list())
#(function() sys.call())()

and imagine your f function is something big... [.data.table has 1850 lines of code

@jangorecki
Copy link
Member

As written now, I hope it will prevent any misconception on that matter in future.
I moved it to new section as it didn't fit well to the one used previously.
Please review.


If function name to be substituted needs to be namespace-qualified then namespace and function name can be substituted as any other symbol in the expression:
```{r substitute_fun3, result='hide'}
DT[, ns::fun(Sepal.Length), env = list(ns="base", fun="sqrt"), verbose=TRUE]
Copy link
Member

@MichaelChirico MichaelChirico Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one in particular feels quite unnatural to me -- almost never do we want to separate the namespace from its export. I can't think of any use case for that. and it gets us even further from one of my main points about do.call() -- using symbols makes the dependency structure of the code much clearer.

let's drop this example. I think the above section is a nice addition -- we can revisit the point about namespace-qualified inputs as we gather more organic use cases in the wild.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can as well be base::fun(col), env=list(fun="sqrt")
And also nsfun(col), env=call("::", as.name("base"), as.name("sqrt"))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even parse(text="base::sqrt") which I would prefer not to mention in our docs

Copy link
Member

@jangorecki jangorecki Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using symbols makes the dependency structure of the code much clearer.

The problem is that it changes the code.

Symbols should be used, and should be preferred but then why to use them in env? They don't need substation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nsfun(col), env=call("::", as.name("base"), as.name("sqrt"))
Or even parse(text="base::sqrt") which I would prefer not to mention in our docs

these days str2lang("base::sqrt") would be better and is sort of "accepted"; anyway, now I think you're hitting on something more interesting: fun(col); env = list(fun = quote(base::sqrt)) is pretty natural and I think solves all of my concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a situation when an i/j/by-expression would need to be evaluated in a completely different context from where it had been written? Would a user normally want to access a function or a piece of data from an environment that's neither the data.table itself nor the frame of the caller of [? Similar to what groupingsets does, hiding its own frame from [ but making the caller's frame available, except with the query itself, not the overall call to [?

Copy link
Member

@jangorecki jangorecki Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DT[, fun(col), env = list(fun = quote(base::sqrt))]

It is then more correct to write

DT[, base::sqrt(col)]

As there is nothing to substitute. No advantage of using env.

Copy link
Member

@jangorecki jangorecki Jul 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aitap don't really know now without a nice example showing the issue, env is force evaluated, i/j/by after substation should in theory behave as written by hand, so scoping should not make any difference than [ written by hand

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is then more correct to write

Right, in the simple case... I am still fairly certain this is needed for more generality. I will let this percolate for some time. In the meantime, let's just drop the ugly ns::fun example which I don't think is useful & we can merge this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document well how to flexibly pass functions in env=
6 participants