Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize checking gates #172

Merged
merged 1 commit into from
Aug 7, 2024
Merged

Optimize checking gates #172

merged 1 commit into from
Aug 7, 2024

Conversation

up2jj
Copy link
Contributor

@up2jj up2jj commented Apr 5, 2024

We use actor-based gates pretty extensively during new feature rollup process and noticed significant slowdowns when flag contains large number of gates (>20k) and is checked multiple times.

Current approach relies on intermediary list created by Enum.filter/2. With new approach only reduce is used.

Benchmark below (5k gates):

--------------------------------------------------------------
$CACHE_ENABLED=true
$PERSISTENCE=ecto
$RDBMS=postgres
$PUBSUB_BROKER=phoenix_pubsub
$CI=
--------------------------------------------------------------
Elixir version:        1.16.2
Erlang/OTP version:    Erlang/OTP 26 [erts-14.0] [source] [64-bit] [smp:10:10] [ds:10:10:10] [async-threads:1] [jit]
Logger level:          :error
Cache enabled:         true
Persistence adapter:   FunWithFlags.Store.Persistent.Ecto
RDBMS driver:          Ecto.Adapters.Postgres
Notifications adapter: FunWithFlags.Notifications.PhoenixPubSub
Anything using Redis:  false
--------------------------------------------------------------
Excluding tags: [:redis_pubsub, :redis_persistence]

...............................................................................................................................................Operating System: macOS
CPU Information: Apple M1 Max
Number of Available Cores: 10
Available memory: 64 GB
Elixir 1.16.2
Erlang 26.0
JIT enabled: true

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 5 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 18 s

Benchmarking new ...
Benchmarking old ...
Calculating statistics...
Formatting results...

Name                 ips        average  deviation         median         99th %
new        1.49 K      669.04 μs     ±1.22%      667.92 μs      692.98 μs
old        1.29 K      775.76 μs    ±12.13%      736.63 μs     1028.36 μs

Comparison:
new        1.49 K
old        1.29 K - 1.16x slower +106.73 μs

Memory usage statistics:

Name          Memory usage
new       468.84 KB
old       546.98 KB - 1.17x memory usage +78.14 KB

**All measurements for memory usage were the same**

@tompave
Copy link
Owner

tompave commented Apr 5, 2024

Hey there, thank you for using the library and for the PR.

This seems a sensible change, thanks!
I'll run some local benchmarks. In the meantime, do you think it would make sense to extend this change to the group gates?

As an aside, I'm glad to hear that the library scales to over 20k actor gates 😆, but in that case would it make sense to use a group gate instead? I doubt that you're managing 20k actors with the web UI, you must be doing so programmatically (maybe with a remote console). In that case, you could also flip some attribute on your users to make them part of a beta-testers group.

@mentero
Copy link

mentero commented Apr 5, 2024

Hey, we often use fun_with_flags to coordinate feature rollout. We have data migration tasks that go over our actors one by one, do the data migration necessary for the feature to work, and if the migration is successful it enables the flag.
After that, we usually go and create a boolean gate and start the code clean-up so we can remove it completely.

This usually worked fine, and we really enjoyed the fine-grained control over the release process, but in this particular case, the flag was heavily used inside a codebase, and we were taken by surprise by the accumulated performance hit of the cache lookup times as it was filled with more and more actor gates. This of course made a lot of sense when we discovered this was the issue and the hotfix was quite simple. We replaced all the actor gates with a single boolean gate (as the migration was done by that time).

We are working on a coldfix right now with a few ideas in mind such as moving the flag checks higher up the callstack or introducing yet another cache layer that will remember all the flags for the specified actor which should be a more suitable strategy for our case

@up2jj
Copy link
Contributor Author

up2jj commented Apr 8, 2024

Hey there, thank you for using the library and for the PR.

This seems a sensible change, thanks! I'll run some local benchmarks. In the meantime, do you think it would make sense to extend this change to the group gates?

As an aside, I'm glad to hear that the library scales to over 20k actor gates 😆, but in that case would it make sense to use a group gate instead? I doubt that you're managing 20k actors with the web UI, you must be doing so programmatically (maybe with a remote console). In that case, you could also flip some attribute on your users to make them part of a beta-testers group.

@tompave taking into account that all these kinds of gates are stored in the same structure (list), large number of actors can affect checking of other gate types. I would like to take the same approach for groups, so will update the PR accordingly. Please let me know if there is anything else I can do

@up2jj up2jj changed the title Optimize checking actor gate Optimize checking gates Apr 8, 2024
@tompave
Copy link
Owner

tompave commented Apr 12, 2024

Hello, thank you for iterating on this.

I just wanted to align expectations on the timeline. This week I've been too busy to look at this, and I'll be travelling for the next few weeks. I'll look at this when I'm back, in May.

@up2jj
Copy link
Contributor Author

up2jj commented May 22, 2024

Hello @tompave

I hope your trip went well 😄 Could I help you in any way with this issue?

@tompave
Copy link
Owner

tompave commented Aug 7, 2024

Hey there, apologies for the long wait. I've tested this locally and it's good to be merged!

Thank you for identifying this area of improvement and for doing the work.

@tompave
Copy link
Owner

tompave commented Aug 7, 2024

I've done a bit of commit surgery to remove some extra whitespace changes and get the PR to build with the latest CI config.

@tompave tompave merged commit 3ba4763 into tompave:master Aug 7, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants