Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[documentation] Experiment using concurrent garbage collector #789

Open
mpickering opened this issue Mar 27, 2020 · 13 comments
Open

[documentation] Experiment using concurrent garbage collector #789

mpickering opened this issue Mar 27, 2020 · 13 comments
Labels
documentation performance Issues about memory consumption, responsiveness, etc. type: enhancement New feature or request

Comments

@mpickering
Copy link
Contributor

If a GC happens during a request there can be a reasonably big pause if you have a largish heap.

We should try using the concurrent GC to attempt to reduce pause times.

@pepeiborra
Copy link
Collaborator

pepeiborra commented Apr 4, 2020

I gave -xn -I1 a try and got all kinds of crashing after a few minutes of loading the ghcide codebase in VSCode:

  • Unknown closure type error
  • Seg fault
  • Lock up

I also checked that these problems did not reproduce without -xn. I cannot repro without -I1 either, i.e. just -xn (ghcide is built with -rtsopts -I0)

Will open tickets upstream for @bgamari and @osa1 to investigate

EDIT: Updated to account for -I1

@pepeiborra
Copy link
Collaborator

Backtrace from gdb:

Reading symbols from /home/pepe/scratch/ghcide/dist-newstyle/build/x86_64-linux/ghc-8.10.1/ghcide-0.1.0/x/ghcide/build/ghcide/ghcide...
[New LWP 23766]
[New LWP 23298]
[New LWP 23321]
[New LWP 23301]
[New LWP 23310]
[New LWP 23326]
[New LWP 23320]
[New LWP 23330]
[New LWP 23318]
[New LWP 23380]
[New LWP 23322]
[New LWP 23316]
[New LWP 23383]
[New LWP 23424]
[New LWP 23302]
[New LWP 23314]
[New LWP 23311]
[New LWP 23308]
[New LWP 23312]
[New LWP 23669]
[New LWP 23317]
[New LWP 23611]
[New LWP 23454]
[New LWP 23315]
[New LWP 23299]
[New LWP 23381]
[New LWP 23313]
[New LWP 23300]
[New LWP 23332]
[New LWP 23319]
[New LWP 23329]
[New LWP 23389]
[New LWP 23323]
[New LWP 23349]
[New LWP 23328]
[New LWP 23327]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/nix/store/9rabxvqbv0vgjmydiv59wkz768b5fmbc-glibc-2.30/lib/libthread_db.so.1".
Core was generated by `/home/pepe/scratch/ghcide/dist-newstyle/build/x86_64-linux/ghc-8.10.1/ghcide-0.'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000003e8bdbd in nonmovingSweepMutLists ()
[Current thread is 1 (Thread 0x7f5f02928700 (LWP 23766))]
(gdb) bt
#0  0x0000000003e8bdbd in nonmovingSweepMutLists ()
haskell/ghcide#1  0x0000000003e6a47c in nonmovingMark_.constprop.0 ()
haskell/ghcide#2  0x0000000003e6a655 in nonmovingConcurrentMark ()
haskell/ghcide#3  0x00007f5fbbe78edd in start_thread () from /nix/store/9rabxvqbv0vgjmydiv59wkz768b5fmbc-glibc-2.30/lib/libpthread.so.0
haskell/ghcide#4  0x00007f5fbbbbaa4f in clone () from /nix/store/9rabxvqbv0vgjmydiv59wkz768b5fmbc-glibc-2.30/lib/libc.so.6

@bgamari
Copy link

bgamari commented Apr 4, 2020

Thanks @pepeiborra! I'm looking into it (although do open a GHC ticket as well)

@pepeiborra
Copy link
Collaborator

@mpickering
Copy link
Contributor Author

Fix is in this MR: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3186

Ben reports

•bgamari> nonmoving collector reduces the average gen1 pause of ghcide from >350ms to ~10ms
6:36 PM <mpickering> That sounds promising
6:36 PM <mpickering>  How much residency is there?
6:37 PM <•bgamari> maximum goes from 1s to 60ms
6:37 PM <•bgamari> bytes copied goes down to a factor of 8

@bgamari
Copy link

bgamari commented May 2, 2020

For the record, this measurement was taken by an ad hoc editing session against the lens library. I am currently working on a more systematic measurement.

@jneira
Copy link
Member

jneira commented Oct 5, 2020

@pepeiborra the ghc mr is merged, did it fixed the problems with the concurrent garbage collector?

@pepeiborra
Copy link
Collaborator

I haven't checked

@pepeiborra pepeiborra transferred this issue from haskell/ghcide Jan 1, 2021
@pepeiborra
Copy link
Collaborator

pepeiborra commented Jan 3, 2021

I have checked again with ghc 8.10.3 and it seems to work pretty well. The crashes are gone. I have seen one crash with --nonmoving-gc -A128M enabled, but it's extremely hard to reproduce and could be something else:

ghcide: internal error: SMALL_MUT_ARR_PTRS_FROZEN_CLEAN object (0x4228156f38) entered!
    (GHC version 8.10.3 for x86_64_apple_darwin)
    Please report this as a GHC bug:  https://www.haskell.org/ghc/reportabug

I also collected some performance numbers using the benchmark suite. Overall, the timings were similar or slightly worse in most benchmarks. I wasn't able to find a set of GC flags using --nonmoving-gc that showed an improvement over the ones we use right now (-A128M -qg -I0). But I did notice that -qg, which disables the parallel GC, is a net loss - all the benchmarks are faster without it.

For the edit experiment in the lsp-types example the chart below shows the live bytes over time (as reported by -S) for various configurations over 100 samples:

  • upstream: -A128M -qg -I0
  • Adefault: -qg -I0
  • parallelGC: -A128M -I0
  • A64: -A64M -qg -I0
  • nmA64: --nonmoving-gc -A64M
  • nmAdefault: -qg -I0 --nonmoving-gc

image

Branch to reproduce: https://github.com/pepeiborra/ide/tree/benchmark-rts-opts-nm

@mpickering
Copy link
Contributor Author

@pepeiborra Thanks for looking into this. I will ask about the panic you are seeing.

Isn't the idea behind using the nonmoving-gc to reduce the pause times? This is interesting for us because if the pause happens when serving a request then the user will notice it. For example, if you hover, then a GC kicks in for 1-2s then the hover response will also be delayed. With the nonmoving-gc then the pause will be shorter and therefore a smoother experience for the user, even if it's slightly slower.

@pepeiborra
Copy link
Collaborator

pepeiborra commented Jan 4, 2021

Yes, that's correct. To measure pauses the benchmark suite needs to be extended to show max time (in addition to total time which it currently does).

@jneira jneira added type: enhancement New feature or request performance Issues about memory consumption, responsiveness, etc. labels Jan 12, 2021
@fishtreesugar
Copy link

fishtreesugar commented Jun 24, 2022

https://twitter.com/monadiccheng/status/1539583255317446658 by @TerrorJack

haskell-language-server with --nonmoving-gc is a lot smoother, when heap size goes beyond 10GiB

@hasufell hasufell changed the title Experiment using concurrent garbage collector [documentation] Experiment using concurrent garbage collector Jul 13, 2022
@hasufell
Copy link
Member

Since this works, can someone raise a PR to add this to the documentation for experimentation friendly users?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation performance Issues about memory consumption, responsiveness, etc. type: enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants