-
Notifications
You must be signed in to change notification settings - Fork 3
[WIP] reqs: project requirements #95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
* Application router | ||
* High-availability by fault tolerance | ||
* Load-balancing by requests distribution | ||
ALB ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EKS uses ELB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but which type? Application or Network LB?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to look into this. We're using an NLB, and we're actually setting it in our config for traefik:
service.beta.kubernetes.io/aws-load-balancer-type: nlb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks.
So we have the ELB's NLB doing L3/L4 load balancing on our Traefik ALB doing L7 route balancing, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct.
|
||
### Databases: PostgreSQL | ||
|
||
* Helm chart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're using Zalando's postgres-operator. The operator installs a CDR and watches the k8s cluster for manifests to initiate posgres clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK... And the operator is installed using Helm right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|
||
#### NAT Gateways (NGW) | ||
|
||
* Per AZ egress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our setup there's only 1 NGW for the whole cluster, to save a bit on costs. (This is configurable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a SPOF to me, unless there's a failover mechanism to pop a NGW in another AZ in case of problem... Is there such a mechanism in place ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation says:
If single_nat_gateway = true, then all private subnets will route their Internet traffic through this single NAT gateway. The NAT gateway will be placed in the first public subnet in your public_subnets block.
This doesn't sound like automatic failover to me... It makes sense to move to one_nat_gateway_per_az = true
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is done, code committed.
|
||
## Observability | ||
|
||
### Log management (ELK/EFK) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use ELK right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have some sort of log aggregation and management? Loki? Graylog? Bare rsyslog aggregation to start with?
I don't think this is a box that can be left unchecked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Prometheus pull the metrics logs and grafana use is database to display in a dashbord the datas.
- In other case, in a developpment and preprod environement K9S is used to read the logs.
- We used also the dashboard to follow our budget who is limited to 80 USD for all the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing right now. I'll look into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loki looks like a nice & cool NKOTB, but the fact that it doesn't indexes logs makes it a one of a kind I need to learn to know...
I tend to favor EFK over ELK because Fluentd and Fluentbit look lighter and leaner than that fat Java stash.
Finally, Graylog has always looked like a nice integrated, batteries included, solution.
|
||
### Log management (ELK/EFK) | ||
|
||
### Metrics (Prometheus/Grafana) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use metrics from the following sources:
- kubernetes (comes preinstalled with the kube-prometheus-stack helm package)
- postgres exporter, which runs as sidecars along our postgres clusters
- from our fastapi app using the Instrumentator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the kube-prometheus-stack collects metrics about nodes?
What would be metrics monitoring black holes at this point?
- EC2/Nodes
- ELB/Traefick?
- EBS ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's metrics collection happening about the nodes inside the cluster. Additional monitoring outside of the cluster could be useful (eg. EBS utilisation, k8s cluster health, etc.).
We have very little customisation to the default kube-prometheus-stack installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean that kube monitoring stack doesn't watch k8s cluster health?!
I believe that a solid list of metrics (families or targets) that are or should be implemented is in order to show that we know where we have to keep our eyes.
|
||
### Metrics (Prometheus/Grafana) | ||
|
||
### Event and alerting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alerting is missing. We would need to add at least something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would we add if we had time? What would be the plan?
I believe it's more important to have an unimplemented plan than no plan at all ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Database master unplanned failover
- CPU utilisation too high (per node and per critical pod)
- Memory utilisation too high (per node and per critical pod)
- Disk utilisation too high (EBS)
- AZ zone lost
- Database backup errors
- Autoscaling limit hit (both EKS and pod autoscaling)
- Too many pod restarts
- Pods not getting scheduled due to resource issues
- Application-specific issues (too many HTTP errors, etc.)
Just to name a few items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few indeed ;)
Actually, I was wondering which alerting channel(s) and solution(s) would we implement?
Snail mail/e-mail/XMPP (remember ?)/Slack/cloud signals/AWS SNS...?
Any nice contender spotted? I used to like Sensu, but I still have to update my tech watch on that point...
|
||
### TODO Recap HA features | ||
|
||
### Backup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our image is immutable.
Our database is backed up automatically, using the built in feature of the postgres-operator.
There's a basebackup created every noon UTC, and the WAL is sent all through the day. The backup is stored in an S3 bucket in the US.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
- Git repository backup
- Images and other build artifacts (Helm chats, etc.) backup
So our Recovery Point Objective (RPO) is 1 day, right?
What's the backup location, conservation and rotation policy?
3/2/1?
Grandfather-father-son?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Git backup: we looked into it, but have nothing ready at this point.
Artifacts backup: no plan at the moment.
Our recovery is 1 day by default, plus we have WAL transfers every 16 MB (I think that's the default). We're using default values, we can be more specific, if needed.
The backup is stored in an S3 bucket in the US.
The backup rotation is keeping the last 5 backups at the moment. We might want to fine-tune it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
S3 looks like a very reliable backend, specially with CRR, and economically sound with Glacier transitioning.
It allows for implementation of 3/2/1 rule:
- 3 copies: easy one
- 2 medias: S3 + Glacier instant retrieval archive
- 1 "offsite": Glacier flexible/deep archive
As well as the good old GFFS:
- Son: daily to S3, rotated weekly/monthly
- Father: weekly to glacier instant, rotated monthly/quarterly
- Grandfather: monthly to glacier flexible, rotated bi-yearly/yearly
Oldest member of each generation could be transitioned to deep archive for an extended period of time.
So, what about:
-
GitHub backup to S3
-
Deployment artifacts (container images): CRR and lifecycle policies
-
EBS: one day looks like an eternity to me, could we add a pinch of snapshoting into that pot? Something like hourly
|
||
### Backup | ||
|
||
### Disaster Recovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The database loads the latest backup automatically when launched with an empty dataset (ie. during initiation phase).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really nice feat!
So what's our Recovery Time Objective (RTO)?
How long does the production env take to build from scratch?
What's the procedure? How is it tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole stack comes online in about 20 minutes (including infra and pods). As our DB is small, the restore is fast. The data transfer happens within AWS's own network, the restore should be reasonably fast even with a larger dataset -- though beyond a GB or more I'd do further testing.
Yes, it's tested, we have repeatedly started and stopped the infra and it comes up online automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All nice and sound!
Further on testing, how would we implement automated testing of our backups?
Some kind of "restore" env aside of [pre]prod, in another region? Or should we simply live test on DR site?
This leads me to our DR strategy... Are we doing cold or warm site?
9d7eb2a
to
a60d4ae
Compare
a60d4ae
to
4a20f72
Compare
4a20f72
to
29d0527
Compare
Create design section Move requirements from project to design section Create sub-section for user stories to split them from requirements Create architecture and specifications sub-sections
Fixes #10