- GC often makes Ruby slow (especialy <= v2.0 ). And that's because of high memory consumption
- Ruby has significant memory overhead
- GC in v2.1+ is 5 times faster!
- Raw performance of v1.9 - v2.3 is about the same
See 001_gc.rb
-
80% of performance optimization comes from memory optimization See 002_memory.rb
-
GC::Profiler has memory and CPU overhead (See wrapper.rb for custom wrapper example).
-
Save memory by avoiding copiyng objects, modify them in place if it is possible (use ! methods).
- If your String is less than 40 bytes - user
<<
, not+=
method to concatenate it and Ruby will not allocate additional object.
See 004_array_bang.rb (w/ GC)
- Read files line by line. And keep in mind not only total memory consumption but also peaks.
See 005_files.rb (w/ and w/o GC)
-
Callbacks cause object to stay in memory. If you store callbacks, do not forget to remove them after they called. See 006_callbacks_1.rb See 006_callbacks_2.rb See 006_callbacks_3.rb Try to avoid
&block
and useyield
instead. -
Iterators use block arguments, so use them carefuly
Issues:
- GC will not collect iterable before iterator is finished
- Iterators create temp objects
Solutions:
- Free objects from collection during iteration & use
each!
pattern See 007_iter_1.rb - Look at C code to find object allocations See 007_iter_2.rb (for ruby < 2.3.0)
Table of `T_NODE` allocations per iterator item for ruby 2.1:
| Iterator | Enum | Array | Range |
| ---------------: | ---- | ----- | ----- |
| all? | 3 | 3 | 3 |
| any? | 2 | 2 | 2 |
| collect | 0 | 1 | 1 |
| cycle | 0 | 1 | 1 |
| delete_if | 0 | — | 0 |
| detect | 2 | 2 | 2 |
| each | 0 | 0 | 0 |
| each_index | 0 | — | — |
| each_key | — | — | 0 |
| each_pair | — | — | 0 |
| each_value | — | — | 0 |
| each_with_index | 2 | 2 | 2 |
| each_with_object | 1 | 1 | 1 |
| fill | 0 | — | — |
| find | 2 | 2 | 2 |
| find_all | 1 | 1 | 1 |
| grep | 2 | 2 | 2 |
| inject | 2 | 2 | 2 |
| map | 0 | 1 | 1 |
| none? | 2 | 2 | 2 |
| one? | 2 | 2 | 2 |
| reduce | 2 | 2 | 2 |
| reject | 0 | 1 | 0 |
| reverse | 0 | — | — |
| reverse_each | 0 | 1 | 1 |
| select | 0 | 1 | 0 |
-
Date parsing is slow See 008_date_parsing.rb
-
Object#class
,Object#is_a?
,Object#kind_of?
are slow when used inside iterators -
Use SQL for aggregation & calculation if it is possible See 009_db (database query itself is only 30ms)
-
Use native (compiled C) gems if possible
-
ActiveRecord uses 3x DB data size memory and calls often calls GC
-
Use
#pluck
,#select
to load only necessary data -
Preload associations if you plan to use them
-
Use
#find_by_sql
to aggregate associations data -
Use
#find_each
&#find_in_batches
-
Use
ActiveRecord::Base.connection.execute
,ActiveRecord::Base.connection.exec_query
,ActiveRecord::Base.connection.select_values
,#update_all
to perform simple operations -
Use
render partial: 'a', collection: @col
, which loads partial template only once -
Paginate large views
-
You may disable logging to increase performance
-
Watch your helpers, they may be iterators-unsafe
Profiling = measuring CPU/Memory usage + interpreting results
For CPU profiling disable GC!
ruby-prof gem has both API (for isolated profiling) and CLI (for startup profiling). It also has Rack Middleware for Rails.
See 010_rp_1.rb
Some programs may spend more time on startup than on actual code execution.
Sometimes GC.disable
may take significant amount of time because of lazy GC sweep.
Use Rack::RubyProf
middleware to profile Rails apps. Include it before Rack::Runtime
to include other middlewares in report.
To disable GC, use custom middleware 010_rp_rails/config/application.rb.
Rails profiling best practices:
- Disable GC
- Always profile in production mode
- Profile twice and discard cold-start results
- Profile w/ & w/o caching if you use it
The most useful report types for ruby-prof (see 011_rp_rep.rb):
- Flat (Shows which functions are slow)
- Call graph (Shows callers and callees)
- Stack report (Shows execution paths; good for small chunks of code)
Ruby-prof can generate callgrind files with CallTreePrinter (see 011_rp_rep.rb).
Callgrind profiles have double counting issue!.
Callgrind profiles show loops as recursion.
It is better to start from the bottom of Call Graph and optimize its leaves first.
Always start optimizing with writing tests & benchmarks.
! Profiler adds up to 10x time to function calls.
If you optimized individual functions but the whole thing is still slow, look at the code at a higher abstraction level.
Optimization tips:
- Optimization with the profiler is a craft (not engineering)
- Always write test
- Never forget about the big picture
- Profiler obscures measurements, benchmarks needed
80% of ruby performance optimization come from memory optimization.
You have 3 options for memory profiling:
- Massif / Stackprof profiles
- Patched Ruby interpreter & ruby-prof
- Printing
GC#stat
&GC::Profiler
measurements
To detect if memory profiling needed you should use monitoring and profiling tools.
Good tool for profiling is Valgrind Massif but it shows memory allocations only for C/C++ code.
Another tool is Stackprof that shows shows number of object allocations (that is proportional to memory consumption) (see 014_stackprof.rb). But if your code allocates small number of large objects, it won't help.
Stackprof could generate flamegraphs and it's OK to use it in production, because it has no overhead.
You need RailsExpress patched Ruby (google it). Then set RubyProf measure mode and use one of printers (see 015_rp_memory.rb). Don't forget to enable memory stats with GC.enable_stats
.
Modes for memory profiling:
- MEMORY - mem usage
- ALLOCATIONS - # of object allocations
- GC_RUNS - # of GC runs (useless for optimization)
- GC_TIME - GC time (useless for optimization)
Memory profile shows only new memory allocations (not total number at time) and doesn't show GC reclaims.
! Ruby allocates temp object for string > 23 chars.
We can measure current memory usage, but it is not very useful.
On Linux we can use OS tools:
memory_before = `ps -o rss= -p #{Process.pid}`.to_i / 1024
do_something
memory_after = `ps -o rss= -p #{Process.pid}`.to_i / 1024
GC#stat
and GC::Profiler
can reveal some information.
For adequate measurements we should measure a number of times and take median value.
A lot of external (CPU, OS, latency, etc.) and internal (GC runs, etc.) factors affect measured numbers. It is impossible to entirely exclude them.
- Disable dynamic CPU frequency (governor, cpupower in Linux)
- Warm up machine
Two things can affect application: GC and System calls (including I/O calls).
You may disable GC for measurements or force it before benchmark with GC.start
(but not in a loop ! because of new object being created in it).
On Linux & Mac OS process fork available to fix that issue:
100.times do
GC.start
pid = fork do
GC.start
m = Benchmark.realtime { ... }
end
Process.waitpid(pid)
end