Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: segfault from C stack overflow #9

Open
bhive01 opened this issue May 15, 2020 · 28 comments
Open

Error: segfault from C stack overflow #9

bhive01 opened this issue May 15, 2020 · 28 comments

Comments

@bhive01
Copy link

bhive01 commented May 15, 2020

See here for my initial look at the issue:
tidyverse/dplyr#5243
wilkelab/ggtext#34

I have continued to work on this all week and figured some things out. Essentially, all of the data in my example is required to reproduce the problem and so is ggtext.

You need 3 components:
RDS: https://drive.google.com/file/d/1lhz7kOTZTKT7Dt0D5bcmbwSldiOLwe6F/view?usp=sharing
RMD: https://gist.github.com/bhive01/6f68c3301dbfb75ba0063db2abbc7f64
Rscript: https://gist.github.com/bhive01/e293135e904745a129165794bd4c0032

The RDS is compressed and uncompresses to about 2 GB in RAM. It is a very large multilayered tibble with premade plots embedded. When I run the Rscript with the entire dataset and ggtext, I get:

processing file: PerTrialReportDev.Rmd
  |.............................                                                                                                                                                  |  17%
  |..........................................................                                                                                                                     |  33%
  |........................................................................................                                                                                       |  50%
  |.....................................................................................................................                                                          |  67%
  |..................................................................................................................................................                             |  83%
  |...............................................................................................................................................................................| 100%
label: outputs (with options) 
List of 7
 $ message   : logi FALSE
 $ warning   : logi FALSE
 $ error     : logi FALSE
 $ fig.retina: logi TRUE
 $ fig.height: num 7
 $ fig.width : num 8
 $ results   : chr "asis"

Error: segfault from C stack overflow

This happens consistently for me (session_info's below). If you '#' out library(ggtext) it runs without issue. I've tried it with only slice 1, 2, 3, 1&2, 1&3, 2&3 and it works, but all 3 rows and it segfaults (with ggtext). In Rstudio you get the bomb, I'm using R-gui.

When I run the script without ggtext:

processing file: PerTrialReportDev.Rmd
  |.............................                                                                                                                                                  |  17%
  |..........................................................                                                                                                                     |  33%
  |........................................................................................                                                                                       |  50%
  |.....................................................................................................................                                                          |  67%
  |..................................................................................................................................................                             |  83%
  |...............................................................................................................................................................................| 100%
label: outputs (with options) 
List of 7
 $ message   : logi FALSE
 $ warning   : logi FALSE
 $ error     : logi FALSE
 $ fig.retina: logi TRUE
 $ fig.height: num 7
 $ fig.width : num 8
 $ results   : chr "asis"


output file: PerTrialReportDev.knit.md

/usr/local/bin/pandoc +RTS -K512m -RTS PerTrialReportDev.utf8.md --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash+smart --output pandoc100c61db9570d.html --email-obfuscation none --self-contained --standalone --section-divs --table-of-contents --toc-depth 4 --variable toc_float=1 --variable toc_selectors=h1,h2,h3,h4 --variable toc_collapsed=1 --variable toc_smooth_scroll=1 --variable toc_print=1 --template /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rmarkdown/rmd/h/default.html --no-highlight --variable highlightjs=1 --variable 'theme:bootstrap' --include-in-header /var/folders/y8/67ytt5n148qfmft0xgrb663m0000gn/T//RtmpAcnOAe/rmarkdown-str100c63c1b331f.html --mathjax --variable 'mathjax-url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' --lua-filter /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rmarkdown/rmd/lua/pagebreak.lua --lua-filter /Library/Frameworks/R.framework/Versions/4.0/Resources/library/rmarkdown/rmd/lua/latex-div.lua 

Output created: DEVoutputs/Strawberry/NZPN/NZS1802/NZPN_NZS1802_Experiment_Data_Summary.html

session_info taken after running
without ggtext: https://gist.github.com/bhive01/491477484034ad109b9f6791d99afa75
with ggtext: https://gist.github.com/bhive01/43e7aa0f1e52e5e53d8f48801153d403

@clauswilke
Copy link
Collaborator

If the problem arises only when ggtext is loaded then it's likely a gridtext issue. However, we will need a much more limited reproducible example. I'd like to ask you to try to whittle away at things until we have isolated exactly what causes the problem. Try to remove anything that you can remove without resolving the problem. Instead of making multiple different figures, see if you can make the same one over and over. See if the problem remains if you don't save and reload figures but instead generate all in one go. Etc.

If this is a memory leak in gridtext, it should be possible to bring this down to one very simple figure that you just create over and over until the segfault. Alternatively, I wouldn't be surprised if it was somehow related to saving and reloading plots, which gridtext may not support. (ggplot plots contain a copy of their environment, and that may cause the trouble.)

@clauswilke
Copy link
Collaborator

Actually, another simple thing to try: Before saving any plots, zero out the environment they hold. You can do this as follows:

# make a plot
p <- ggplot(iris, aes(Sepal.Length, Sepal.Width, color = Species)) +
  geom_point() +
  facet_wrap(~Species)

# zero out the environment
p$plot_env <- rlang::new_environment()

# now save
...

@bhive01
Copy link
Author

bhive01 commented May 15, 2020

OK, I did the environment thing on all of the plots into the tibble. I saved that tibble as an RDS here:
https://drive.google.com/file/d/1qjljuTTiAnzV_ldZKl-9lKDlDf6z_vLm/view?usp=sharing
When loading this into the above script and knitting I get the same error.

processing file: PerTrialReportDev.Rmd
  |.............................                                                                                                                                                  |  17%
  |..........................................................                                                                                                                     |  33%
  |........................................................................................                                                                                       |  50%
  |.....................................................................................................................                                                          |  67%
  |..................................................................................................................................................                             |  83%
  |...............................................................................................................................................................................| 100%
label: outputs (with options) 
List of 7
 $ message   : logi FALSE
 $ warning   : logi FALSE
 $ error     : logi FALSE
 $ fig.retina: logi TRUE
 $ fig.height: num 7
 $ fig.width : num 8
 $ results   : chr "asis"

Error: segfault from C stack overflow

To your comment about whittling this down, I've been doing that for nearly the whole week, much to my employers chagrin. At first I thought it was because I was using development dplyr/vctrs/glue, which is why I started this adventure in the dplyr repo, but yesterday through guess and checking I got to ggtext, which is reproducible on two of my macs. I'm sorry it's so large, but I've tried really hard to get to where I am.
I will do my best to create a smaller example with made up data and report back.

@clauswilke
Copy link
Collaborator

Too bad the environment is not the issue. It would have been an easy fix.

@bhive01
Copy link
Author

bhive01 commented May 15, 2020

OK. I got it with a simpler example.
One that doesn't require my data so I'll be taking that data down now.
Two files now:
RMD: https://gist.github.com/bhive01/668abe7417ea23a014e9458edd349f5f
rscript: https://gist.github.com/bhive01/014dd24a4ffb919eb76c967c1414a478

This takes mtcars and makes 1200 samples of it with a plot that uses ggtext in the caption.
It segfaulted on me in the same way as my data.
UPDATE: Just tried it again and found it works with 700 samples on my machine.

@clauswilke
Copy link
Collaborator

I'm glad you're able to cut this down, but please remove absolutely everything that is not needed. Make it the smallest possible example that causes the problem. Also, do you need the separate rscript? What if you just make 1000 plots in the .Rmd? What if you just make 1000 plots without using knitr at all?

@bhive01
Copy link
Author

bhive01 commented May 15, 2020

I am trying. I'm sorry if this is too piecemeal for you. I spent a couple of hours trying to get Valgrind and lldb to work as Thomas suggested. I downloaded Winston's Docker, but I've not used that stuff enough to get to the problem quickly.
I can tell you that I tried to simply print (via quartz) 1200 plots and that worked. I combined the two scripts into one and I can get it to fail with knitting. Knitr seems to be required, but I want to check one more thing (saving the plots, which is what knitr does).
Here is the combined Rmd:
https://gist.github.com/bhive01/49b5777fafb4fe60e23b9fc275d1e50b
I'll work on the file saving now and update here...

@clauswilke
Copy link
Collaborator

clauswilke commented May 15, 2020

Just in general, I'm referring to things like this:
https://gist.github.com/bhive01/668abe7417ea23a014e9458edd349f5f#file-plot_num_test-rmd-L15
or this:
https://gist.github.com/bhive01/49b5777fafb4fe60e23b9fc275d1e50b#file-plot_num_test_1file-rmd-L48
None of this code is likely needed for the reprex, so just ruthlessly delete it all. Similarly, in the plot, I would suspect you need only a single element_markdown(). (In fact, the next step will be to remove the plot entirely and instead use a call to richtext_grob() directly, as in example below.)

Minimal gridtext example:

library(grid)
library(gridtext)

g <- richtext_grob("Hello!")
grid.newpage()
grid.draw(g)

Created on 2020-05-15 by the reprex package (v0.3.0)

@ltierney
Copy link

ltierney commented May 16, 2020

You are getting a C stack overflow so that is probably function calls nested too deeply. To figure out what is happening I would run under a debugger and look at the C call stack when this happens.

For me on Linux this runs OK with my standard settings which has an 8Mb C stack. If I drop that to 1Mb then I get stack overflows in R_ReleaseObject. The implementation there isn't ideal;
I'll try to improve that soonish. That might clear this up, especially if the stack size on your platform is lower. If it doesn't, looking at the C stack in a debugger on your end is probably your best bet.

If this is the issue, then for some reason a humongous number of objects is being put on the R_PreserveObject list, and that probably isn't a great idea.

@bhive01
Copy link
Author

bhive01 commented May 16, 2020

Shortest one yet!
https://gist.github.com/bhive01/2ac1fed2fd0280605c3d64161dfe60e2
Took me all day. It's about 10 minutes for it to fail on me.
Luke, I tried to get Valgrind to work according to Kevin Ushey's blog, but it refused to work. I then tried lldb and it was similar. I downloaded Winston's docker but for some reason the R version is 3.6 and I don't know how to run this code in that space with a debugger active. If you or anyone here can help me do that, I will do it.
One thing I noticed was that when I shortened the caption text it worked. So tomorrow morning that's where I'm going to start. Changing the number of breaks and **'s and inserting more generic text of a given length. I wonder if it has anything to do with the wrapping code or the buffer required to hold that many line breaks?

@ltierney
Copy link

ltierney commented May 16, 2020

It's a stack overflow, not a memory error. So valgrind isn't likely to help. You need to catch the segfault in a debugger. Catalina has made that harder.

On a mac or on Linux you can make this fail more quickly by starting a shell and reducing your stack size before running your example.

The R extensions manual says R_PreserveObject should be used "sparingly". Running your code puts about 300K objects on the preserved list. The R_ReleaseObject code wasn't written with that in mind. I'll commit a fix soon to R-patched and R-devel that will prevent the stack overflow.

It is still a horrible design (somewhere in the C code that is running) to push that many things into the preserved object list and rely on finalizers to eventually clean things up.

@bhive01
Copy link
Author

bhive01 commented May 16, 2020

It seems to require knitr to make the stack overflow. I suspect that the issue is really there, but is exacerbated by ggtext/gridtext. I would have never come across this if this particular experiment had better data structure (the outputs are fairly useless using my pipeline for this data set), but it all needs to go through it.
Anyway, I don't know how to set up the debugger on catalina. Everything I've tried fails for some reason or another. lldb and gdb.

@clauswilke
Copy link
Collaborator

@ltierney This may be a strange interaction of several different packages. The gridtext package is written with Rcpp, so it never explicitly calls R_PreserveObject. It just uses the API provided by Rcpp. This seems related (but has never seen any attention): RcppCore/Rcpp#382

@clauswilke
Copy link
Collaborator

Is it possible to work around the issue by simply increasing the stack size?

@ltierney
Copy link

In principle yes. On Linux the default size is 8M and that is enough for the example I ran. I forget what the default is on Macs and how much you can raise that without sudo level fiddling.

@clauswilke
Copy link
Collaborator

clauswilke commented May 16, 2020

It turns out it's 8M also on Mac and there's no simple way to make it bigger (only smaller). Making it bigger requires some special linking flags. https://developer.apple.com/library/archive/qa/qa1419/_index.html

@ltierney
Copy link

ltierney commented May 16, 2020

It may be that on Linux I'm just barely avoiding the segfault.

I've committed the changes to avoid deep recursion in R_ReleaseObject in r78474 (R-devel) and r78475 (R-patched) so these should make their way into nightly builds soon and R 4.0.1 in a few weeks. The objects are still stored in a linked list, so the linear search performance issue Kevin Ushey mentions is still there.

It would still be good to understand why this list is getting so large, and why it is only being cleaned up by (eventually) getting finalizers to run. As a design this would be a bad idea, but it may be that there is a better design in place and something isn't working quite right.

You can see one aspect of the problem if you run Brandon's example with source and then manually run gc until things clear out. Many of these gc calls take forever, at least in part because of the linear searches.

To expand on this: If I call this function after sourcing Brandon's code it takes about 1000 seconds.

cleanup <- function() {
   old <- 0
   system.time(
       repeat {
            new <- gc()[[1]]
            if (old == new) return(old)
            else old <- new
       })
}

If I change the implementation of the preserved object set to a hash table it is 3 seconds. But with more complex and slightly non-portable code, so not somewhere I particularly want to go.

@clauswilke
Copy link
Collaborator

@ltierney Thanks for committing a fix in R_ReleaseObject!

My hunch is this will ultimately trace back to the fact that gridtext generates thousands of little strings and stores each in a separate R string vector. I had to do it this way to avoid encoding issues on Windows. Once we move to UTF throughout I can just store the strings as regular C++ objects and circumvent all the R memory allocation/release issues.

@ltierney
Copy link

But that doesn't explain why they have to end up in the preserve object list. I very rarely use that mechanism; when I do it is usually for some table I want to keep alive for an entire session without making it reachable by some other means, like an R variable. I'm definitely missing something about the usage here.

@clauswilke
Copy link
Collaborator

Rcpp calls R_PreserveObject on every individual string:
https://github.com/RcppCore/Rcpp/blob/446c3cafdf514e6d9db1d84c580b77c5b1d8f3e2/inst/include/Rcpp/String.h#L55-L58
So maybe the problem is thousands of strings that stick around until the end of the session. One explanation for why this behaves differently with or without knitr could be that knitr actually holds on to all the grobs after plotting whereas just calling plot(...) doesn't. So in one case things get cleaned up as we go and in the other they don't. Testing this right now.

@ltierney
Copy link

I think I'm beginning to see. It looks like Rcpp is misusing this feature. A better way would seem to be for Rcpp to maintain it's own table of R objects it wants to keep alive, probably using weak references, and just register that table with R_PreserveObject. Access to that table could then be optimized fo the Rcpp use case, probably using hashing. That should give better performance on the Rcpp side and not stress the R_PreserveObject mechanism in ways it isn't designed for.

@clauswilke
Copy link
Collaborator

Something like this seems to kill my R session reliably. For some reason I need the second loop.

library(ggplot2)
library(ggtext)
library(grid)

plot_grob <- function(df) {
  p <- ggplot(data.frame(x = 1, y = 1)) +
    geom_point(aes(x, y)) +
    labs(caption = "<br>Pink dots represent outliers and are removed from downstream analyses.<br>Error bars represent the standard error of the mean.<br>ANOVA model significance *p* < 0.05.<br>Treatments with the same letter are not significantly different at *α* = 0.05 according to Tukey's HSD. ") + 
    theme(plot.caption =  element_textbox_simple())
  ggplotGrob(p)
}

l <- list()
for (i in 1:1000) {
  cat(i, " ")
  g <- plot_grob()
  grid.newpage()
  grid.draw(g)
  l[[i]] <- g
}

l <- NULL
l <- list()
for (i in 1:1000) {
  cat(i, " ")
  g <- plot_grob()
  grid.newpage()
  grid.draw(g)
  l[[i]] <- g
}

@clauswilke
Copy link
Collaborator

If I comment out the list assignments in the above code (i.e., this: l[[i]] <- g) everything works fine.

@bhive01 Now thinking some more about this, the problem in your .Rmd may be that you're generating all the plots in one chunk. Depending on how knitr is written, it may hold on to all output from one chunk before processing it. If you can somehow rewrite your code so plot generation is broken up over several chunks this problem may go away.

@ltierney
Copy link

This is in R 4.0.0, not R-devel with my commit I assume? That makes sense -- it is putting about 300K objects into the preserved object list before it starts to clean up; if it does start to clean up during the second loop that takes a long time.

@clauswilke
Copy link
Collaborator

Actually, for me it's R 3.6. I haven't upgraded to 4.0 yet.

@bhive01
Copy link
Author

bhive01 commented May 16, 2020

@clauswilke, you might be on to something there and that might explain why knitr is connected to my version of the issue. I do know that it generates the png files all at once and then pulls them into the html document. after encoding them. Is this something I should bring up with Yihui? Or is this likely to be "fixed" by Luke's edits?
Regarding your comment about changing my RmD. I'm using a walk to iterate over the plots and output them for a given section of data at a time. I'm not sure how to split that work as I would need to iterate over chunks to get that to work and I have no idea how to do that.

@clauswilke
Copy link
Collaborator

I think knitr behaves the only way it can. You could split up the figure generation over more chunks if Luke's patch doesn't solve the issue for you. Or you could write a separate script that generates the pngs and then just pull them into your report directly.

@ltierney
Copy link

The stack overflow segfault will be gone but, depending on when the cleanups kick in, you may have to go for a cup of coffee or two while waiting for it to finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@bhive01 @ltierney @clauswilke and others