-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix index printing by adding index info to header (#6806) #6816
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
aitap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guide that you and your colleagues are following has some advice that is counter-productive. Where does it come from? Perhaps it should be changed?
Putting the issue number that you would like to fix into the pull request title doesn't help: GitHub ignores it. Instead you need to put the number into the text (i.e. the comment) of the pull request. Then GitHub will link the two together.
Let's try to stick to the code formatting style used by this project: when closing an if block and following it with an else block, put both braces on the same line as the word else:
if (whatever) {
# some code
} else { # <-- like here
# some more code
}It's usually best to minimise the changes you're introducing to a code base. This is both easier to review and has less chance of introducing bugs. Since R is mostly ambivalent about spaces, let's not introduce extra space at the end of some but not all of the lines. On the Files tab of this pull request, you can see "lint-r" complaining about extra spaces. Could you please remove them?
But the main problem is the suggested solution. print(x) shouldn't have to change variables above it in the function call stack. The main issue here is that abbs is constructed from classes1(x) without taking into account that cbind(x[...,], index_dt) may be printed instead of just a subset of x. There must be a simpler way of either (1) extracting the classes from both data.tables or (2) padding abbs with a string for each index.
R/print.data.table.R
Outdated
| setnames(index_dt, print_names) | ||
| indices <- names(attr(x, "index", exact = TRUE)) | ||
| if (length(indices)) { | ||
| cleaned_indices <- gsub("^__|_", ", ", indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gsub with _ as a complete subexpression will match _ anywhere, including in the middle of the column name. Why does the code match it here?
R/print.data.table.R
Outdated
| indices <- names(attr(x, "index", exact = TRUE)) | ||
| if (length(indices)) { | ||
| cleaned_indices <- gsub("^__|_", ", ", indices) | ||
| cleaned_indices <- sub(", $", "", cleaned_indices) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a way to fix the code above so that it doesn't create a separator at the end of the string in the first place.
R/print.data.table.R
Outdated
| if (exists("header", envir = parent.frame(), inherits = FALSE)) { | ||
| # Match data.table's existing header handling | ||
| assign("header", c(get("header", envir = parent.frame()), header), | ||
| envir = parent.frame()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this code intended to do? Where is the header variable used? Use of parent.frame() in this context is almost certainly a mistake because parent.frame() is the environment of the function that calls print(x). Why is the exists() check needed? Why assign()? In R, if doesn't create a new lexical scope (unlike a function call), so your changes to the variables will be retained outside the block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @aitap, for your valuable feedback.
I now understand the issues you pointed out and will try to make the necessary changes accordingly. Initially, I declared header because an error occurred when an index existed, and since the index is metadata, I created header to store it. My intent was to update header dynamically when printing a data.table, ensuring index information is included.
However after you tell that if statements do not create a new lexical scope , I came to know about it. This led to header being modified in the parent environment instead of being managed locally within print.data.table().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place in the code where the header variable is used after being created by this block?
|
Hi @aitap However, after implementing these modifications, three tests from tests.Rraw are failing because the expected output differs from the observed output. Specifically, in these tests, the expected output does not include an extra column for index_dt. The failing tests are as follows: Could you please clarify this. |
|
Hi @Mukulyadav2004
Could you please push your changes to this pull request? It's very hard
to diagnose code only knowing that it produces unwanted output but
without seeing the changes to the output code.
The expected output in tests 1775.2, 1775.3, 1775.4 does not contain
the index columns because show.indices= is not set, only print.keys= is.
|
aitap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making the improvements! You are approaching a good solution.
R/print.data.table.R
Outdated
| if (exists("header", envir = parent.frame(), inherits = FALSE)) { | ||
| # Match data.table's existing header handling | ||
| assign("header", c(get("header", envir = parent.frame()), header), | ||
| envir = parent.frame()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a place in the code where the header variable is used after being created by this block?
R/print.data.table.R
Outdated
| classes = classes1(x) | ||
| col_names <- colnames(toprint) | ||
| classes <- sapply(col_names, function(col_name) { | ||
| if (grepl("^index:", col_name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not very reliable. A user can create a column whose name starts with index:: data.table(`index:foo`='bar'). Moreover, grepl('^static-string') can usually be replaced by startsWith(), which avoids the cost of having to compile a regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your feedback. I have reviewed the code, and currently, the header variable is being assigned within this block. However, I do not see any explicit usage of header after its creation. I believe the header might be printed due to the following code snippet:
if (print.keys) {
if (!is.null(ky <- key(x)))
catf("Key: <%s>\n", toString(ky))
if (!is.null(ixs <- indices(x)))
cat(sprintf(
ngettext(length(ixs), "Index: %s\n", "Indices: %s\n"),
paste0("<", ixs, ">", collapse = ", ")
))
}
However, I am not entirely certain, and I would appreciate any clarification if I have misunderstood
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me rephrase that. When you wrote the code that introduced the header variable, where did you intend it to be used? I'm asking this question because there is no other use of this variable in R/print.data.table.R, and no uses of variables named header in other *.R files are applicable.
If you are using a machine learning model to write the code or the comments for you, please stop doing that, because it's doing us both a disservice. Relying on machine output deprives you from learning opportunities and prevents you from accumulating skill. (If machine learning models start writing good code by themselves, what use will there be for human programmers like you and I?) Having to review code that superficially looks like it would work but ultimately proves useless wastes maintainer time and morale.
R/print.data.table.R
Outdated
| } | ||
| }) | ||
| abbs = unname(class_abb[classes]) | ||
| abbs[classes == "index"] <- "<index>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if an R package provides an S3 class called "index"? There's more than 20000 CRAN packages, plus some more Bioconductor packages, plus a lot of packages only published on GitHub or even not published at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid such conflicts, I will ensure that only actual index columns are modified by explicitly tracking them using indices(x). This way, we only modify columns that are confirmed to be indices, rather than relying on a generic "index" class check.
R/print.data.table.R
Outdated
| expression = "<expr>", ordered = "<ord>") | ||
| classes = classes1(x) | ||
| col_names <- colnames(toprint) | ||
| classes <- sapply(col_names, function(col_name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of overwriting classes by re-creating it anew, can you append the index markers to the variable classes created two lines above?
R/print.data.table.R
Outdated
| abbs = unname(class_abb[classes]) | ||
| abbs[classes == "index"] <- "<index>" | ||
| if ( length(idx <- which(is.na(abbs))) ) abbs[idx] = paste0("<", classes[idx], ">") | ||
| stopifnot(length(abbs) == ncol(toprint)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defensive programming is good, but it's better to add a test to inst/tests/tests.Rraw to verify that printing a data.table with indices does not raise any warnings due to invalid rbind().
|
Hi @aitap, Expected Output: Observed Output: All failures appear to be of the same nature—the index columns are missing in the observed output. Could you please provide guidance on what might be causing |
inst/tests/tests.Rraw
Outdated
| test(2264.8, print(DT, show.indices=TRUE), output=ans) | ||
| # printing does not fail when indices are present | ||
| test(2264.9, { | ||
| suppressWarnings( print(DT, show.indices=TRUE) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of suppressing warnings, test that they don't occur. Try capturing the output from a print(DT, show.indices=TRUE) with the warning fixed and call test(2264.9, print(DT, show.indices=TRUE), output = c(...example output goes here...)).
|
The current patch both removes the code that creates |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6816 +/- ##
=======================================
Coverage 98.60% 98.60%
=======================================
Files 79 79
Lines 14661 14661
=======================================
Hits 14457 14457
Misses 204 204 ☔ View full report in Codecov by Sentry. |
|
Hi @aitap, I have incorporated all the changes as per your suggestions, and this time, all tests from inst/tests/tests.Rraw have passed successfully. However, the following checks have failed: Could you please advise on how to proceed with these. |
|
|
Hi @MichaelChirico |
aitap
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better, but the changes to unrelated code need to be reverted, and it's better to reuse the classes1 logic instead of reimplementing it.
R/print.data.table.R
Outdated
| factor = "<fctr>", POSIXct = "<POSc>", logical = "<lgcl>", | ||
| IDate = "<IDat>", integer64 = "<i64>", raw = "<raw>", | ||
| expression = "<expr>", ordered = "<ord>") | ||
| classes = classes1(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an idea. The code uses classes1(x) to obtain the class names of the column. It's already mostly the right answer, and it handles things like non-length-1 class vectors. The problem here is that toprint, which is what will get printed, may be produced from not just x, but cbind(x, indices(x)), and so may have a different number of columns.
Instead of manually reimplementing classes1 below, it should be enough to either append <index> to classes (length(indices(x)) times) if show.indices is true, or capture classes1(toprint) before it is formatted.
inst/tests/tests.Rraw
Outdated
| # Test for covering classes[col_name] <- "index" | ||
| DT <- data.table(A = 1:3, B = 4:6) | ||
| setindex(DT, A) | ||
| test(2306.1, {print(DT); capture.output(print(DT))}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using capture.output(...) in tests, put the expected output into the output= argument of the test() function. This will automatically skip the output test when running with translation enabled instead of failing it. You also don't have to manually call print() when using the output= argument; the test() function calls print() for you when you ask it to test the output. There are other useful arguments described in help(test, data.table).
R/print.data.table.R
Outdated
| cls <- class(x[[col_name]]) | ||
| if (is.list(cls)) cls <- unlist(cls) | ||
| if (length(cls) == 0) cls <- "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think R will allow making the class vector a list. Also, class() will return the name of the primitive type of the value if the class attribute is set empty or removed altogether, so both tests here are impossible.
R/print.data.table.R
Outdated
| if (length(cls) == 0) cls <- "unknown" | ||
| classes[col_name] <- cls[1] | ||
| } else { | ||
| classes[col_name] <- "unknown" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is likely to be impossible due to the way toprint is constructed from x.
|
Hi @aitap, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there. Here's a test:
library(data.table)
DT <- data.table(a = 1:2, b = 2:1)
setindex(DT, b)
data.table:::test(
0,
{ print(DT); TRUE },
options = c(datatable.show.indices = TRUE)
)Currently it fails with "produced 1 warnings but expected 0". In order to get it to pass, you need to produce the correct classes variable. Currently, you add the indices to the column headers twice. Once when you call classes <- classes1(toprint):
debug: classes <- classes1(toprint)
# ...
Browse[1]> print(classes)
[1] "integer" "integer" "integer"
...and for the second time when you append rep("<index>", length(indices(x))):
debug: classes <- c(classes, rep("<index>", length(indices(x))))
# <...>
Browse[1]> print(classes)
[1] "integer" "integer" "integer" "<index>"
Use one technique but not the other.
R/print.data.table.R
Outdated
| class(toprint) <- c("data.table", "data.frame") | ||
| } | ||
| classes <- classes1(toprint) | ||
| col_names <- colnames(toprint) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this variable is not needed now.
R/print.data.table.R
Outdated
| } | ||
| require_bit64_if_needed(x) | ||
| if (!inherits(toprint, "data.frame")) { | ||
| class(toprint) <- c("data.table", "data.frame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your review and feedback.
The reason for its need is that, when running the tests without this patch, I encountered the following error:
Running test id 2222 Test 2222 produced 1 errors but expected 0
Expected:
Observed: data.table inherits from data.frame (from v1.5), but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table?
Output captured before unexpected warning/error/message:
<simpleError in dimnames.data.table(x): data.table inherits from data.frame (from v1.5), but this data.table does not. Has it been created manually (e.g. by using 'structure' rather than 'data.table') or saved to disk using a prior version of data.table?>
To address this, I added a check to ensure that toprint correctly inherits from data.table and, consequently, data.frame. This prevents potential issues where toprint might lose its expected class during processing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error is signalled by colnames(toprint). Since the col_names variables isn't used anywhere, it's better to remove both it and the inherits(toprint, ...) check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for your persistence. The two failing CI checks are currently expected. The test fails on master (checked manually) but passes here.
Produce the class header from `toprint` instead of just `x`. Fixes: Rdatatable#6806 Co-authored-by: Ivan K <krylov.r00t@gmail.com>
Produce the class header from `toprint` instead of just `x`. Fixes: Rdatatable#6806 Co-authored-by: Ivan K <krylov.r00t@gmail.com>
Problem:
Currently, when options(datatable.show.indices = TRUE), print.data.table() tries to add index info to toprint. However, toprint may have a different number of columns, leading to the error:
Error in rbind(abbs, toprint) : number of columns of result is not a multiple of vector length (arg 1)Fix:
Instead of modifying toprint, this PR adds the index information directly to header, ensuring a cleaner and safer display.
Changes:
Extracts index names from the "index" attribute.
Formats them (removes __ and replaces _ with , ).
Creates a "Indices: ..." header string.
Appends this to header.
File changed: print.data.table R