Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread: memory not being freed #3292

Closed
patrickhowerter opened this issue Jan 17, 2019 · 4 comments · Fixed by #4710
Closed

fread: memory not being freed #3292

patrickhowerter opened this issue Jan 17, 2019 · 4 comments · Fixed by #4710
Milestone

Comments

@patrickhowerter
Copy link

I have found that reading data into a data.table will cause a memory leak. This seems to happen whether I use fread or read_fst(from the fst package). I suspect that this could even be an R or OS issue, however, I have not found any cases of this on Stack Overflow...

Steps to reproduce:

  1. Download the sample file here:
download.file("https://mdodemodiag770.blob.core.windows.net/testcontainer/test.csv", "test.csv")
  1. Create an R script as the following:
tst <- data.table::fread("tst.csv");
rm(tst);
gc()
  1. Execute the R script and check the results on your OS. I am using Ubuntu 18.04.1. I see the following results of my R process when I run the top command (2.6 gbs in Virtual memory);
  PID   USER        PR VIRT         RES        SHR   S   %CUP %MEM TIME + COMMAND
15827 patrick+  20   0  2.625g 1.269g 0.013g S   0.0  1.2   0:12.93 rsession
 

--If you want to run in it with valgrind, you can use the following:
4. Run the following valgrind command (change the source script to the location of your R script from step 2):

R -d "valgrind --show-leak-kinds=all" -e "source('/home/patrick.howerter/valgrind_test.R')"
 

Here is the end of the valgrind output.
5.

==533==
==533== HEAP SUMMARY:
==533==     in use at exit: 134,671,721 bytes in 19,047 blocks
==533==   total heap usage: 42,608 allocs, 23,561 frees, 1,113,301,337 bytes allocated
==533==
==533== LEAK SUMMARY:
==533==    definitely lost: 0 bytes in 0 blocks
==533==    indirectly lost: 0 bytes in 0 blocks
==533==      possibly lost: 4,320 bytes in 15 blocks
==533==    still reachable: 134,667,401
 
bytes in 19,032 blocks
==533==         suppressed: 0 bytes in 0 blocks
==533== Rerun with --leak-check=full to see details of leaked memory
==533==
==533== For counts of detected and suppressed errors, rerun with: -v
==533== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
 

sessionInfo()

R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] bit_1.1-14        compiler_3.5.2    tools_3.5.2       yaml_2.2.0        bit64_0.9-7       data.table_1.12.0

``

 
 
@patrickhowerter
Copy link
Author

For a comparison, I tried the readr:read_csv function. This seems to free up most of the memory.

read <- readr::read_csv("/mnt/batch/tasks/shared/fileshare/mdo_data/downloads/TREstSum2000");
rm(read);
gc()

top command output:

PID   USER          PR   NI               VIRT    RES        SHR   S   %CUP %MEM TIME + COMMAND
10512 patrick+  20   0  465.4m 139.0m  25.5m S   0.0  0.1   0:36.46 rsession

If you run the read.csv command from base R, the same problem as fread.

#3719842      33
read <- read.csv("/mnt/batch/tasks/shared/fileshare/mdo_data/downloads/TREstSum2000");
rm(read);
gc()

top command results:

PID   USER          PR   NI               VIRT    RES        SHR   S   %CUP %MEM TIME + COMMAND
12169 patrick+  20   0 1879.1m 1.518g  24.6m S   0.3  1.4   1:44.99 rsession

@roysh913
Copy link

Just came across this post while solving memory leakage issue. One of the problems is still with fread.

require(data.table)

for(i in 1:10) {
  sampleDt <- fread("sample-data.csv", header = TRUE, stringsAsFactors = FALSE)
  rm(sampleDt)
  gc()
  print(pryr::mem_used())
}
# 54.3 MB
# 61.4 MB
# 68.3 MB
# 75.2 MB
# 82.2 MB
# 89.1 MB
# 96 MB
# 103 MB
# 110 MB
# 117 MB

Using read.csv has no leakage.

for(i in 1:10) {
  sampleDt <- read.csv("sample-data.csv", header = TRUE, stringsAsFactors = FALSE)
  rm(sampleDt)
  gc()
  print(pryr::mem_used())
}
# 47.7 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB
# 47.8 MB

sessionInfo()

R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.12.2

loaded via a namespace (and not attached):
[1] compiler_3.6.0   pryr_0.1.4       magrittr_1.5     tools_3.6.0      Rcpp_1.0.1       stringi_1.4.3    codetools_0.2-16 stringr_1.4.0

@MichaelChirico
Copy link
Member

Confirming reproducibly on Mac:

library(data.table)
tmp = tempfile()
iris = as.data.table(iris)
DT = rbindlist(replicate(1e5, iris, simplify = FALSE))
fwrite(DT, tmp)

for(i in 1:10) {
  sampleDt <- fread(tmp, header = TRUE, stringsAsFactors = FALSE)
  rm(sampleDt)
  gc()
  print(pryr::mem_used())
}
736 MB
875 MB
1.01 GB
1.15 GB
1.29 GB
1.43 GB
1.57 GB
1.71 GB
1.85 GB
1.98 GB

@realalexgalenko
Copy link

realalexgalenko commented Feb 1, 2020

Finishing my 6 hour memory leak search here. Sadly, same issue here with fread. Running MacOS with 1.12.8 data.table.

My memory usage progression looks like

33 MB
283 MB
533 MB
783 MB
1.03 GB
1.28 GB
1.53 GB
1.78 GB
2.03 GB
2.28 GB

jimhester added a commit to jimhester/data.table that referenced this issue Sep 18, 2020
@ColeMiller1 ColeMiller1 added this to the 1.14.1 milestone Feb 25, 2021
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@mattdowle mattdowle modified the milestones: 1.14.7, 1.14.5 Nov 15, 2022
@mattdowle mattdowle added the bug label Nov 15, 2022
@mattdowle mattdowle modified the milestones: 1.14.7, 1.14.6 Nov 16, 2022
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Jan 25, 2024
# data.table [v1.14.10](https://github.com/Rdatatable/data.table/milestone/20)

## NOTES

1. Maintainer of the package for CRAN releases is from now on Tyson
  Barrett (@TysonStanley),
  [#5710](Rdatatable/data.table#5710).

2. Updated internal code for breaking change of `is.atomic(NULL)` in
  R-devel,
  [#5691](Rdatatable/data.table#5691). Thanks to
  Martin Maechler for the patch.

3. Fix multiple test concerning coercion to missing complex numbers,
  [#5695](Rdatatable/data.table#5695) and
  [#5748](Rdatatable/data.table#5748). Thanks
  to @MichaelChirico and @ben-schwen for the patches.

4. Fix multiple format warnings (e.g., -Wformat)
  [#5712](Rdatatable/data.table#5712),
  [#5781](Rdatatable/data.table#5781),
  [#5880](Rdatatable/data.table#5800),
  [#5786](Rdatatable/data.table#5786). Thanks to
  @MichaelChirico and @jangorecki for the patches.



# data.table [v1.14.8](https://github.com/Rdatatable/data.table/milestone/28?closed=1)  (17 Feb 2023)

## NOTES

1. Test 1613.605 now passes changes to `as.data.frame()` in R-devel,
  [#5597](Rdatatable/data.table#5597). Thanks to
  Avraham Adler for reporting.

2. An out of bounds read when combining non-equi join with `by=.EACHI`
  has been found and fixed thanks to clang ASAN,
  [#5598](Rdatatable/data.table#5598). There
  was no bug or consequence because the read was followed (now preceded)
  by a bounds test.

3. `.rbind.data.table` (note the leading `.`) is no longer exported
  when `data.table` is installed in R>=4.0.0 (Apr 2020),
  [#5600](Rdatatable/data.table#5600). It was
  never documented which R-devel now detects and warns about. It is only
  needed by `data.table` internals to support R<4.0.0; see note 1 in
  v1.12.6 (Oct 2019) below in this file for more details.


# data.table [v1.14.6](https://github.com/Rdatatable/data.table/milestone/27?closed=1)  (16 Nov 2022)

## BUG FIXES

1. `fread()` could leak memory,
  [#3292](Rdatatable/data.table#3292). Thanks
  to @patrickhowerter for reporting, and Jim Hester for the fix. The fix
  requires R 3.4.0 or later. Loading `data.table` in earlier versions
  now highlights this issue on startup, asks users to upgrade R, and
  warns that we intend to upgrade `data.table`'s dependency from 8 year
  old R 3.1.0 (April 2014) to 5 year old R 3.4.0 (April 2017).

## NOTES

1. Test 1962.098 has been modified to pass latest changes to `POSIXt`
  in R-devel.

2. `test.data.table()` no longer creates `DT` in `.GlobalEnv`, a CRAN
  policy violation,
  [#5514](Rdatatable/data.table#5514). No
  other writes occurred to `.GlobalEnv` and release procedures have been
  improved to prevent this happening again.

3. The memory usage of the test suite has been halved,
  [#5507](Rdatatable/data.table#5507).


# data.table [v1.14.4](https://github.com/Rdatatable/data.table/milestone/26?closed=1)  (17 Oct 2022)

## NOTES

1. gcc 12.1 (May 2022) now detects and warns about an always-false
  condition (`-Waddress`) in `fread` which caused a small efficiency
  saving never to be invoked,
  [#5476](Rdatatable/data.table#5476). Thanks to
  CRAN for testing latest versions of compilers.

2. `update.dev.pkg()` has been renamed `update_dev_pkg()` to get out
  of the way of the `stats::update` generic function,
  [#5421](Rdatatable/data.table#5421). This is a
  utility function which upgrades the version of `data.table` to the
  latest commit in development which has passed all tests. As such we
  don't expect any backwards compatibility concerns. Its manual page was
  causing an intermittent hang/crash from `R CMD check` on Windows-only
  on CRAN which we hope will be worked around by changing its name.

3. Internal C code now passes `-Wstrict-prototypes` to satisfy the
  warnings now displayed on CRAN,
  [#5477](Rdatatable/data.table#5477).

4. `write.csv` in R-devel no longer responds to
  `getOption("digits.secs")` for `POSIXct`,
  [#5478](Rdatatable/data.table#5478). This
  caused our tests of `fwrite(, dateTimeAs="write.csv")` to fail on
  CRAN's daily checks using latest daily R-devel. While R-devel
  discussion continues, and currently it seems like the change is
  intended with further changes possible, this `data.table` release
  massages our tests to pass on latest R-devel. The idea is to try to
  get out of the way of R-devel changes in this regard until the new
  behavior of `write.csv` is released and confirmed. Package updates are
  not accepted on CRAN if they do not pass the latest daily version of
  R-devel, even if R-devel changes after the package update is
  submitted. If the change to `write.csv()` stands, then a future
  release of `data.table` will be needed to make `fwrite(,
  dateTimeAs="write.csv")` match `write.csv()` output again in that
  future version of R onwards. If you use an older version of
  `data.table` than said future one in the said future version of R,
  then `fwrite(, dateTimeAs="write.csv")` may not match `write.csv()` if
  you are using `getOption("digits.secs")` too. However, you can always
  check that your installation of `data.table` works in your version of
  R on your platform by simply running `test.data.table()`
  yourself. Doing so would detect such a situation for you: test 1741
  would fail in this case. `test.data.table()` runs the entire suite of
  tests and is always available to you locally. This way you do not need
  to rely on our statements about which combinations of versions of R
  and `data.table` on which platforms we have tested and support; just
  run `test.data.table()` yourself. Having said that, because test 1741
  has been relaxed in this release in order to be accepted on CRAN to
  pass latest R-devel, this won't be true for this particular release in
  regard to this particular test.

    ```R
    $ R --vanilla
    R version 4.2.1 (2022-06-23) -- "Funny-Looking Kid"
    > DF = data.frame(A=as.POSIXct("2022-10-01 01:23:45.012"))
    > options(digits.secs=0)
    > write.csv(DF)
    "","A"
    "1",2022-10-01 01:23:45
    > options(digits.secs=3)
    > write.csv(DF)
    "","A"
    "1",2022-10-01 01:23:45.012

    $ Rdevel --vanilla
    R Under development (unstable) (2022-10-06 r83040) -- "Unsuffered Consequences"
    > DF = data.frame(A=as.POSIXct("2022-10-01 01:23:45.012"))
    > options(digits.secs=0)
    > write.csv(DF)
    "","A"
    "1",2022-10-01 01:23:45.012
    ```

5. Many thanks to Kurt Hornik for investigating potential impact of a
  possible future change to `base::intersect()` on empty input,
  providing a patch so that `data.table` won't break if the change is
  made to R, and giving us plenty of notice,
  [#5183](Rdatatable/data.table#5183).

6. `datatable.[dll|so]` has changed name to `data_table.[dll|so]`,
  [#4442](Rdatatable/data.table#4442). Thanks to
  Jan Gorecki for the PR. We had previously removed the `.` since `.` is
  not allowed by the following paragraph in the Writing-R-Extensions
  manual. Replacing `.` with `_` instead now seems more consistent with
  the last sentence.

    > ... the basename of the DLL needs to be both a valid file name
      and valid as part of a C entry point (e.g. it cannot contain
      ‘.’): for portable code it is best to confine DLL names to be
      ASCII alphanumeric plus underscore. If entry point R_init_lib is
      not found it is also looked for with ‘.’ replaced by ‘_’.


# data.table [v1.14.2](https://github.com/Rdatatable/data.table/milestone/24?closed=1)  (27 Sep 2021)

## NOTES

1. clang 13.0.0 (Sep 2021) requires the system header `omp.h` to be
  included before R's headers,
  [#5122](Rdatatable/data.table#5122). Many
  thanks to Prof Ripley for testing and providing a patch file.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants