Skip to content

Introduce ReplicationCoordinator to support multiple AZs #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

flavorjones
Copy link
Member

@flavorjones flavorjones commented Jun 9, 2025

This is a draft PR to solicit some initial feedback.

I've got a branch of HEY using this at https://github.com/basecamp/haystack/pull/7837 which is deployed in the beta1 environment.

I've got a branch of Solid Queue using this interface at flavorjones/solid_queue#1

And I've got a MovableWriter gem that implements this interface for replicated SQLite (still very rough, needs to be tested and the processes table does not yet do the right thing) at https://github.com/basecamp/fizzy/pull/580


Motivation / Background

It's common for applications that are deployed across multiple availability zones (using a replicated database) to create an ad-hoc method for processes to discover which zone is "active", meaning the zone primarily responsible for writing to the database.

For example, a team may choose to use a MySQL system variable to indicate the data center where the primary database sits. In which case, they need to write code to make sure all Rails processes in all zones query this efficiently (it may be slow to access in non-primary zones) and are notified if the primary zone changes, as in the case of a data center failover.

Detail

ReplicationCoordinator::Base is introduced to allow developers to write code that determines whether a process is in an active zone, and then:

  • monitor and cache that value, with configurable polling interval
  • fire callbacks when the state changes from active -> passive or vice versa

Additionally, a test helper class is provided to simplify testing failover behavior.

Finally, Rails is configured by default to use a simple concrete replication coordinator, SingleZone, which always indicates the caller is in an active zone.

Additional information

TODO - link to the Solid Queue PR once it's open

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Unrelated changes should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

It's common for applications that are deployed across multiple
availability zones (using a replicated database) to create an ad-hoc
method for processes to discover which zone is "active", meaning the
zone primarily responsible for writing to the database.

For example, a team may choose to use a MySQL system variable to
indicate the data center where the primary database sits. In which
case, they need to write code to make sure all Rails processes in all
zones query this efficiently (it may be slow to access in non-primary
zones) and are notified if the primary zone changes, as in the case of
a data center failover.

`ReplicationCoordinator::Base` is introduced to allow developers to
write code that determines whether a process is in an active zone, and
then:

- monitor and cache that value, with configurable polling interval
- fire callbacks when the state changes from active -> passive or vice versa

Additionally, a test helper class is provided to simplify testing
failover behavior.

Finally, Rails is configured by default to use a simple concrete
replication coordinator, `SingleZone`, which always indicates the
caller is in an active zone.
flavorjones added a commit to flavorjones/solid_queue that referenced this pull request Jun 9, 2025
Note that this pins Rails to a version with
ActiveSupport::ReplicationCoordinator

see basecamp/rails#35
flavorjones added a commit to flavorjones/solid_queue that referenced this pull request Jun 9, 2025
Note that this pins Rails to a version with
ActiveSupport::ReplicationCoordinator

see basecamp/rails#35
@flavorjones
Copy link
Member Author

flavorjones commented Jun 10, 2025

Chatted a bit with my teammates including @kevinmcconnell, @rosa, @djmb, and @jeremy and there were some thoughts around how to make this idea more general and broadly applicable:

Additional use cases

  • how to failover something other than the primary database like:
    • Active Storage S3 region
    • a non-primary Active Record database
    • Redis
  • how to distinguish degrees of active/passive
    • "passive" meaning "absolutely no writes", e.g. replicated sqlite where writes are only possible on the active host
    • "passive" meaning "some writes", e.g. replicated mysql/postgres where writes are slow and should be avoided but are possible over an inter-DC network link

These thoughts point in the direction of having a "replication coordinator" associated with a specific component (or set of components) rather than a single global instance.

  • e.g., with a specific relational database -- encompassing the primary and replicas
  • e.g., with a specific Active Storage service

Interface

They also point in the direction of using something more complex than a boolean active_host?, perhaps this could be a String, but might also support something structured like JSON, which would then allow a broad set of subclasses that solve specific replication patterns.

It would probably be worth doing an inventory of gems like redis-sentinel and rails_failover (and others) to get a feel for what some of the likely integrations look like and what sort of interface would make it easy to do that integration.

Tighter integration with Active Record

Finally, there's a potential subclass of Replication Coordinator for relational databases that interacts with ActiveRecord::Base.current_role, which might help compress some complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant