-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Highly available cluster with multiple nodes #1571
Comments
Atlantis was not designed to be set up with multiples nodes. Usually, Atlantis users do not have highly available Atlantis servers, they have a big instance or multiple instances running different webhooks integrations ( maybe behind the same LB). Now talking about the reason of having such a setup, why does it needs to be HA? Infra as code is not service so it does not have service dependencies, meaning it does not need to be "up". |
As Pepe mentioned our reliance on BoltDB is really the limiting factor here. Bolt is intended to be used as an embedded database for applications and cannot be safely shared between processes. There has been a few PR discussions around creating a unified abstraction over database access to allow for pluggable database providers but to my knowledge no work has been done yet. I believe it would be possible to run multiple atlantis instances with project configuration to limit each instance to only handling a subset of files but it is not possible to run multiple instances that function as one server. |
Hi @jamengual & @acastle, As I can see there is a Locker interface and it's implemented by boltdb.go, this persists the state into atlantis.db file. |
This is a lot of work just to itemize the needed changes and right now this is out of scope for us. |
#265 (comment) talks about some of the work. The locker isn't the hard part. It's the reliance on the filesystem for storing plans and for knowing which PRs are in progress. |
in the docs
I setup EFS and specify the ATLANTIS_DATA_DIR as the mount. My first instance started fine. but when I made some other changes, Fargate started the second instance before the first instance gets killed.....which failed with
so my question is, can a BoltDB created from instance "A" on EFS be picked up and used by instance "B"? I think we can make FG completely kill the old container before staring the new one....if so....all the locks will be available to the new instance, so devs don't have to "re-plan"..... but will Bolt have issues in that design |
We currently run single atlantis instance, but I landed here, because I was considering Provider Plugin Cache for Atlantis and it explicitly mentions that
so I got interested whether it is possible to run multiple replicas of atlantis and whether the cache should be per replica or not. The answer for me is that I don't need to consider multiinstance scenario just yet - I think I might not be the only one and it would be worth mentioning in the Atlantis Docs, that it is expected to run just a single Atlantis replica. EDIT: I am also wondering how one can run Atlantis as Kubernetes Deployment, where it is not guaranteed, that there will always be just a single replica. |
Hi @jamengual & @acastle, I would like to follow up this topic as the Redis locking DB is available now. Referring to my original question, is it feasible to use Atlantis with multiple nodes as of now? I can envision a two tenant "cluster" environment with an active and passive node, the locking DB is hosted in Azure Redis and the working directory is on a shared drive. Only one node is active at any time period in order to avoid any interference, a load balancer (e.g. Azure Traffic Manager) would monitor the active node and the nodes could be swapped in case of any issue of the active one. Is this design feasible? |
so to have Ha with Atlantis using Redis you still need a way to share tha Atlantis data dir between containers, if you do that you can have active active containers and some people already running like that. |
Sharing the data dir is easily achievable in our Azure environment. But won't we have any issues when multiple active nodes are writing the same data dir files? Does the current Atlantis design exclude this? |
no, because the lock now is on redis.( if you enable it)
…On Thu, Nov 3, 2022 at 4:12 PM Istvan Tapaszto ***@***.***> wrote:
so to have Ha with Atlantis using Redis you still need a way to share tha
Atlantis data dir between containers, if you do that you can have active
active containers and some people already running like that.
Sharing the data dir is easily achievable in our Azure environment. But
won't we have any issues when multiple active nodes are writing the same
data dir files? Does the current Atlantis design exclude this?
—
Reply to this email directly, view it on GitHub
<#1571 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQ3EREXSG2G4IIOAWK54MLWGRBHPANCNFSM44RAFHYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
To close this ticket, I think some official docs are needed on how redis locking can be used to spin up more than one instance/pod of atlantis |
I agree.
…On Mon, Nov 7, 2022 at 12:10 PM nitrocode ***@***.***> wrote:
To close this ticket, I think some official docs are needed on how redis
locking can be used to spin up more than one instance/pod of atlantis
—
Reply to this email directly, view it on GitHub
<#1571 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQ3ERBHJBLNJ4IT6BBA4CDWHFOZXANCNFSM44RAFHYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Any news when this will be officially documented regarding implementation? This also blocks: terraform-aws-modules/terraform-aws-atlantis#322 |
@gartemiev none. This is an open source project and we depend 100% on user contributions. Please feel free to try out this feature, experiment, and see what works. If you can get it working and document it, everyone would appreciate it. |
did you enable Redis locking? are you running parallel plans and applies?
…On Mon, Jan 16, 2023 at 8:26 AM nitrocode ***@***.***> wrote:
@gartemiev <https://github.com/gartemiev> none. This is an open source
project and we depend 100% on user contributions. Please feel free to try
out this feature, experiment, and see what works. If you can get it working
and document it, everyone would appreciate it.
—
Reply to this email directly, view it on GitHub
<#1571 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQ3EREBTMB6FDUH7JMYECLWSVZDNANCNFSM44RAFHYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
So in order to have HA in Atlantis:
I'll test this Relates to: |
You may not need to share disk space. I'm unsure of this since i haven't tested it, but it's possible that redis is housing not only the lock but possibly the plans as well. Please test with shared disk space and without. This will be handy in documentation on the website |
NFS or any shared file system is required since the plans are NOT store in Redis. There were some weird behaviours like this error being shown in multiple atlantis instances and multiple times in some of them, this might not be related to the solution: {"level":"error","ts":"2023-03-03T16:04:32.624Z","caller":"logging/simple_logger.go:163","msg":"invalid key: b5bacfe9-e187-4e6b-af0a-d169b785e0a2","json":{},"stacktrace":"github.com/runatlantis/atlantis/server/logging.(*StructuredLogger).Log\n\tgithub.com/runatlantis/atlantis/server/logging/simple_logger.go:163\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).respond\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:92\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).getProjectJobsWS\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:70\ngithub.com/runatlantis/atlantis/server/controllers.(*JobsController).GetProjectJobsWS\n\tgithub.com/runatlantis/atlantis/server/controllers/jobs_controller.go:83\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2109\ngithub.com/gorilla/mux.(*Router).ServeHTTP\n\tgithub.com/gorilla/mux@v1.8.0/mux.go:210\ngithub.com/urfave/negroni/v3.Wrap.func1\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:59\ngithub.com/urfave/negroni/v3.HandlerFunc.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:33\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/runatlantis/atlantis/server.(*RequestLogger).ServeHTTP\n\tgithub.com/runatlantis/atlantis/server/middleware.go:70\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Recovery).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/recovery.go:210\ngithub.com/urfave/negroni/v3.middleware.ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:51\ngithub.com/urfave/negroni/v3.(*Negroni).ServeHTTP\n\tgithub.com/urfave/negroni/v3@v3.0.0/negroni.go:111\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2947\nnet/http.(*conn).serve\n\tnet/http/server.go:1991"} Also when having multiple Atlantis behind the LB when you try to look for the logs of the plan you might or not get them since it is being load balanced 😅, with some flags maybe the Atlantis logs could be also "centralized" so any Atlantis instance can show you the logs. Also at least using NFS it felt slow, so maybe look into store the plans in Redis could improve this 👀 cc: @nitrocode |
Ooofa... Thanks for the update |
How about a failover mechanism with shared PV/PVC? I don't think a HA multi-nodes is a good way to solve Terraform and Atlantis. Because normally only 1 worker at any time can execute the plan/apply to any tfstate. So how about supporting a failover mechanism like:
I think a standby instance could just solve this easily. |
If you run in Kubernetes or an Autocaling group of 1 you'd already get that experience though @anhdle14. |
yeah that is true, I was thinking more of a failover scenario when cluster went down for a particular zone/region. I think for my case I actually need to have the deployment on multiple clusters but this should work for a single cluster deployment. atlantis.example.com/team point to different deployment etc... |
I am working with @tapaszto who originally opened this thread. Since I can see there are some recent comments here let me share my thoughts. We have been using Atlantis for almost 3 years. First, we were hosting it in an ACI then migrated to App Service and right now we are discussing moving to AKS. Since the Redis option became available for the lock DB we were planning to make our environment more resilient. The ultimate goal would be to have a multi-zone and multi-region active-active-active deployment. Our preference would be to stay on App Service, that said there are certain storage limitations there. Since the repo content is still stored on disk, the disk needs to be shared across the nodes. For that either we use Azure Files (SMB) or Blobfuse (with AKS), but both of these are at least 5x slower than writing the content to a local disk. These are not options sadly because of the performance. AKS is now offering shared ZRS Managed Disk support which we are actively exploring. This might solve the zone redundancy requirement if we move to AKS but still will not solve the geo-redundancy requirement. For now, we are considering a primary-secondary (active-passive) deployment, potentially sharing the locking database across regions but not the files as there is no technical solution for that. I think that the next step for this project when it comes to resiliency is to have a solution for the git content/plan files. |
I was pointed to the https://github.com/lyft/atlantis fork which makes use of temporal workflows. That would be a heavy lift to pull in but something fully distributed like that is what I'd prefer vs NFS shares. |
@jukie - Thanks for sharing this, I spent some time going over this and I definitely have some questions and thoughts. I went through the README of project Neptune. I can see how potentially Temporal and its engine would solve failures and would enable HA even across regions. That said, It would be great to understand whether project Neptune is just a fork which planned to be used in Lyft or there is plan to merge this back in some shape and form as a new major version in the future to the upstream version. It seems the Neptune workflow is targeting Terraform actions happening after a PR merge: This is a big behavioral change which (at leats for us) would not be the preferred way of handling deployments. There might be some edge cases but the majority of the deployments for us must happen the way they happen today for consistency purposes: The code cannot be merged before a terraform apply succeeds. For the type of workflow what Neptune tries to cover we already have options like CI/CD pipelines. Do not get me wrong, this is useful and I see the value, but this fork is raising a lot of questions in my mind and it would be really great to see what is the future of the upstream version of Atlantis. |
@nishkrishnan from Lyft may have some comments about this too. I think Atlantis is great, but it lacks in few areas when it comes down to Enterprise deployments and very busy deployments. The current workarounds work but at the core Atlantis was not built to be highly available and that is requirement for some companies. I'm not opposed ( but I'm not the only mantainer) to expand on the Redis usage or maybe even bringing some of the Lyft work upstream but with the modifications needed to keep the current flow into Atlantis 2.0 for example. This kind of effort will need coordination (which I'm willing to provide) and multiple people working actively/committed to this effort. The possibility of multiple companies contributing to this is possible too. |
Hey, i can speak a little bit about Lyft. We completely rearchitected Atlantis to the point that a lot of original stuff in there is pretty much unused/deleted. Atlantis in it's current state is great in terms of flexibility but that's a double edged sword and especially impacts the testability and iterative development of the product. So in order to ease the rearchitecture and simplify things a bit, we've made an opnionated version in a way with less features but enough to POC the new backend in a reasonable amount of time. That said, I don't believe it's worth it to try and re-integrate with upstream given the divergence. I see it as a new product entirely with a heavier dependency tree (ie. Temporal). Lyft initially wanted to have another repo in the Atlantis org owned by us where we could own, build and iterate on this version, but there were some political differences that stopped us, so we kept our work in our fork. As for what we plan to do with it, i think that depends on general interest. I'd love to hear from the community about their usecases/setups etc. I'm usually out and about in the Atlantis slack channel so feel free to hmu and we can chat. |
I like the idea building the platform on Temporal, as I mentioned above, I totally see the benefits. When it comes to use cases and setups, I will try to collect some of ours:
I think the above are the most important ones. In my opinion there should be a tactical and a strategic solution. The tactical could be supporting Redis for the file store, while I could easily imagine a strategic end goal for a new more sophisticated major version maybe based on Temporal and pulling some of the Lyft code in. Let me know your thoughts. |
I agree but whatever is built it needs to still support the current VCS types we have and a streamlined configuration method ( so we don't have 150 flags) with the mayor and most popular options used, which the lyft fork does not support. As for the geo settings that I think can be achieved at the infrastructure level so if just a HA version is built we can improve from that. |
Definitely, If we choose Redis for the file backend, that would solve the zone/geo requirement. https://redis.com/redis-enterprise/technology/active-active-geo-distribution/ |
I think there is something to work on here that can address Atlantis without having a massive lift and shift of the backend like Lyft chose to do with Temporal. As Nish said, it's an entirely new product at that point. My focus as of late as a new maintainer is trying to organize the project at a higher level after a transition in maintainers. We've set up a new Google Group and calendar event for Office Hours to try and organize around the community on key pain point areas that Atlantis is lacking. The core issue we have is that we need to be backward compatible to a certain degree and our release process needs to reflect that. As Pepe mentioned, the over-abundance of configuration flags shows the feature set fracturing when there is not a clear direction on what problems Atlantis is trying to solve. You'll see some structure around higher-level objectives that the community as a whole is experiencing coming soon as we try to organize the community. Especially around reliability and scalability. For example, have been numerous regressions lately due to new features that attempt to solve an edge case for one user, but break entire features for the rest of the community. We have to remember as an open-source project we do not have the time or resources to compete with paid offerings and should not. We will never reach feature parity at the same level of quality. We need to be focusing on core workflows/features that address the majority of the needs of the community. |
For the sake of this issue, I see these things being bottlenecks for HA:
A proposal from either the community is welcome for solving these, as it is a bigger architectural shift than the current single binary design. I will set up some templates soon for proposals (taking influence from other OSS/CNCF/Sig projects). Let me know what you all think. |
I'd be interested in this functionality, but from more of a scaling perspective rather than availability. The Atlantis instance at my company is inactive 99% of the time (which I imagine is the case for most). Due to this we run Atlantis in a micro GCP instance. This generally works fine until you have a large Terraform project to plan or a few parallel plans at the same time (where the instance becomes CPU bound). Ideally I'd like to run the 'scheduler' side of things in the micro VM that basically just handles webhooks and the redis queue. In addition to this there would then be an autoscaling pool of Atlantis workers which pick up plans and applys from the redis queue. If the plans were kept in Redis maybe this would work? From my use case it wouldn't matter too much if there is a short initialisation time for the worker to pull the repo etc. Happy to help with contributions where required. |
So, we got plans/applies/locks to Redis, which is great, but still compel ReadWriteOnce in the helm chart for the PVC template on the StatefulSet? Any reason for that? If plans/applies/locks can now go to Redis, aren't the only filesystem objects the Terraform binaries it downloads, plus cloned git repositories, with the locking and concurrency governed by the plans/applies/locks managed by Redis entries? If making that configurable in the Helm chart is the only blocker there, I can have a PR up in five minutes. I just want to make sure I'm not missing something important before I put one up. |
there is still information in boltDB ( the atlantis db) that was not
migrated to redis because it required a lot of other code changes.
so Redis locking is not enough by itself to make it HA, you will have other
issues after that.
…On Thu, Apr 20, 2023 at 10:02 AM John Stewart ***@***.***> wrote:
So, we got plans/applies/locks to Redis, which is great, but still compel
ReadWriteOnce in the helm chart for the PVC template on the StatefulSet?
Any reason for that? If plans/applies/locks can now go to Redis, aren't the
only filesystem objects the Terraform binaries it downloads, plus cloned
git repositories, with the locking and concurrency governed by the
plans/applies/locks managed by Redis entries?
If making that configurable in the Helm chart is the only blocker there, I
can have a PR up in five minutes. I just want to make sure I'm not missing
something important before I put one up.
—
Reply to this email directly, view it on GitHub
<#1571 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAQ3ERHNMWCAPPTIKQEEJWLXCFT3HANCNFSM44RAFHYA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@jamengual dreams, dashed! Ah well. Will continue to keep eyes-on this issue and hope :) |
We are using Atlantis as a single pod in EKS but the problem is when we wanna upgrade EKS cluster it kills the Atlantis pod and the pipeline fails. If we could use a disruption budget we could solve the problem but we need an Atlantis cluster. Running Atlantis in EC2 also can solve the problem but we are trying to just use EKS not other services. |
@jamengual - Maybe I am missing something here. I though that if you use Redis there is no BoltDB file in the file system. You are right about that there are other issues, like the previously mentioned repository content what comes from git. |
I'm no expert on this part of the code but I remember there were other things stored in boldtb, it is not just the locks stored there. |
Only the locks are stored in redis, not plan files yet as far as I'm aware. |
The plan files were never stored in the database. The plan files are on the filesystem. I just deployed an instance with Redis and I cannot see the BoltDB file. |
I am doing some testing on my end, so let me add some further details. Lock resiliency is currently resolved by Redis. When you enable Redis there will not be a file based (BoltDB) locking database created. Annotations
The below is for Azure DevOps. It parses the JSON body of the hooks and extracting the PR URL as a "key" to base the session persistence on. Nginx is hashing then this value and based on the number of instances it will send the traffic to a certain backend. This is deterministic so as long as you have the same amount of instances behind, the session will persist and will be bound to the same instance.
Then there are still a few caveats:
An alternative option which I considered was to write a wrapper around the entire thing. The wrapper would be the "frontend" of the app which then would handle persistence based on its own database. While I do not like the idea of maintaining something extra and homegrown, theoretically it would work except again some caveats:
|
Your setup might work on K8s with nginx ingress but if you were in AWS using ECS you would need to use WAF to do that and that is just not the best solution IMHO. Anything that requires you to look at the body of a request to make a decision as to where to send it is not scalable or easy to maintain nor is a good practice from an application development perspective. As you noted about the users hitting the logs page and having to refresh to find the If the Atlantis API offered a way to push the same event received after the request was received to the right server then that will solve that issue.(As you noted too if there is a way to /get from the API the jobID+server to send the request) Redis solves some issues but there is data that is not there, maybe if we can store, jobid, workspace status(PR repo clones), and such, we could maybe make the API more powerful to help redirect the calls to the right server. |
@jamengual - We have a hard requirement to increase the resiliency of our internal setup by the end of this year. I 100% agree with you that using the ingress to do this is not ideal, nor should be done in a perfect world, but I just cannot rule out any workaround at this point due to the requirements. The above was meant to be sort of a summary for everyone who might have similar requirements/desire to understand what the issues are. I agree with you that most of this should be solved on the Atlantis code level and not with infra workarounds. |
totally @Dilergore we all have constraints in our companies and I totally understand where you are coming from, I had to do something similar using AWS WAF due to the business requirements. |
Hi Folks! |
Awsome @Pardeep009. Thanks for sharing this. It's very useful. Hopefully, your changes for problem #2.2 will get merged into atlantis offically |
@Pardeep009: is there a PR open here to incorporate your changes for the additional lock? |
There is one in draft state, require more work to be done before raising it for review. |
I meant: is there a PR upstream here instead of in your fork? I like where your idea is heading, I wonder if @jamengual had a reason to keep this lock separate from the other lock that allows you to configure a backend. |
No, there is no PR in the upstream here. |
I'm did bit coded the original lock implementation |
@Pardeep009: maybe it would be good for you to explain in a new issue what your thought process is (you can re-use the points from your blog post). It'd be good to get some feedback from the atlantis engineering team. |
@Pardeep009 but this solution is not valid for EKS StatefulSet, because even if you have an EFS, a pvc is created for each replica and they have different volumes |
Reviving this thread 😁 The helm chart currently supports using a storage class that can be configured as EFS. So one less blocker. |
We are trying to set up a highly available Atlantis cluster with multiple nodes for prod environment and currently testing with two nodes behind a load balancer. In order to have the nodes with the same data/status we deployed Atlantis data folder as a common file share (Azure files) and mounted this share to both nodes, but unfortunately both nodes start to fail and send application exceptions that I attached.
Questions:
Can the same set of data files shared among multiple Atlantis server instances as we envisioned?
Is this issue due to specific file locking mechanism of Atlantis?
Can this issue fixed by any code change or this is not easily achieved by smaller amount of code change. We have the intention to put development effort into it if it is easily achievable.
Generally, what is the advise/best practice in order to have a highly available Atlantis environment with multiple nodes?
The text was updated successfully, but these errors were encountered: