Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

by loops that should be possible fail because dogroups grows beyond 2**31 #5613

Open
Kodiologist opened this issue Mar 14, 2023 · 5 comments
Labels
longvec Long Vector support

Comments

@Kodiologist
Copy link

> data.table(i = 1:100)[, verbose = T, by = i, {message(".GRP: ", .GRP); list(v = logical(1e7))}]
Detected that j uses these columns: <none> 
Finding groups using forderv ... forder.c received 100 rows and 1 columns
0.000s elapsed (0.000s cpu) 
Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
lapply optimization is on, j unchanged as '{'
Old mean optimization is on, left j unchanged.
Making each group and running j (GForce FALSE) ... .GRP: 1
.GRP: 2
dogroups: growing from 10000000 to -2147483648 rows
Error in `[.data.table`(data.table(i = 1:100), , verbose = T, by = i,  : 
  negative length vectors are not allowed

I know that data.table still doesn't support long vectors (#3957), but this operation should only require 100 * 1e7 = 1e9 rows, which is well under 2**31 - 1 ≈ 2.1e9. And if the user really does ask for too many rows, the error message can surely be improved.

Output of sessionInfo()

R version 4.2.1 (2022-06-23)                                                    
Platform: x86_64-pc-linux-gnu (64-bit)                                          
Running under: Ubuntu 20.04.4 LTS                                               
                                                                                
Matrix products: default                                                        
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0                         
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0                            
                                                                                       
locale:                                                                                       
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                                                           
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8                                                           
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8                                                          
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                                                                        
 [9] LC_ADDRESS=C               LC_TELEPHONE=C                                                                   
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C                                                              

attached base packages:                                 
[1] stats     graphics  grDevices utils     datasets  methods   base                                             

other attached packages:                                
[1] data.table_1.14.8                                   

loaded via a namespace (and not attached):                                                                       
[1] compiler_4.2.1
@ben-schwen
Copy link
Member

This is triggered by the internal estimate of rows of the result, which exceeds the 2**31-1 in your example. Subsequently, all column vectors are grown to that estimated length.

Apart from supporting long vectors which would include basically touching all C-files, we could try to improve that estimate

@Kodiologist
Copy link
Author

It might be easier and more reliable to just cap the estimate at 2**31 - 1.

@ben-schwen
Copy link
Member

Just capping the estimate at 2**31-1 unfortunately just kills the R process (if the cap is exceeded) which doesn't seem like an improvement to me

@Kodiologist
Copy link
Author

But the cap wouldn't be exceeded here, right? So wouldn't it make this code work? Better error messages for trying to make a data table that has too many rows would also be helpful, clearly, but that's supplementary.

@ben-schwen
Copy link
Member

In your example, it definitely works and gives the right output.

But slightly changing your example to data.table(i = 1:220)[, verbose = T, by = i, {message(".GRP: ", .GRP); list(v = logical(1e7))}] kills the R process at group 147

@jangorecki jangorecki added the longvec Long Vector support label Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
longvec Long Vector support
Projects
None yet
Development

No branches or pull requests

3 participants