- Motivation
- Monitoring and Observability
- Execution
- Internal Messaging
- Storage
- External APIs
- Security
- Robustness
- Resources
As an application grows in users, it can face problems that limit its capacity to grow with this demand. Bottlenecks in the system may start being observed that degrade the performance for users:
- Slow requests to load your social media timeline due to high load on the database
- Verification emails not being sent because the email-send worker is overloaded
- Timeouts being hit for API requests leading to failed checkouts
These and many more can deter users away and affect how well you meet your business KPIs.
This guide aims to collate various practices and approaches, some used, to solve different types of bottlenecks. Some may be applicable or not - take what you will.
Knowing that there is a degradation in performance is the first key step to being able to identify solutions for it.
Systems often rely on different services to provide certain functionalities. You may have an authentication server to dedicated to managing authentication for users. You may have an API server for delivering data to your mobile app. Services like these are integral to making your application work.
If a service has crashed due to an uncaught software bug or a hardware failure, this can be detrimental.
To check if a service has crashed, intermittent calls to the service can be made. If the service is not responding, it is likely to be down.
Approaches | Description |
---|---|
Manual Heartbeats | π§ TODO π§ |
Native Integration with AWS Cloudwatch | π§ TODO π§ |
Native Cloud Monitoring with Google Cloud Platform | π§ TODO π§ |
API Monitoring with Postman | π§ TODO π§ |
Website Monitoring with Better Uptime | π§ TODO π§ |
Kubernetes Cluster Monitoring | π§ TODO π§ |
A service may also not be down, but it may be experiencing a degradation in performance as it does not have enough resources to handle a certain intended workload. Monitoring resource usage can help you identify when this happens.
Approaches | Description |
---|---|
Cron Job | π§ TODO π§ |
Flame Graphs | π§ TODO π§ |
Infrastructure Monitoring with Datadog | π§ TODO π§ |
Native Integration with AWS Cloudwatch | π§ TODO π§ |
Native Cloud Monitoring with Google Cloud Platform | π§ TODO π§ |
Kubernetes Metrics Server | π§ TODO π§ |
Whilst monitoring tells you if a service is suffering an issue, observability aims to provide you with details on why the issue is occurring.
Approaches | Description |
---|---|
Logging to Console | π§ TODO π§ |
Logging to an API endpoint | π§ TODO π§ |
Logtail | π§ TODO π§ |
Splunk | π§ TODO π§ |
Datadog | π§ TODO π§ |
New Relic | π§ TODO π§ |
AppDynamics | π§ TODO π§ |
Increasing the capacity of your application can be a key step to solving bottlenecks. This will generally optimise the performance of your application, but you should compare this to other more-dedicated solutions that may be more cost-effective.
Vertical scaling focuses on allocating more resources to a single instance. Included also are considerations to optimise resource usage on an instance-granular level.
Approaches | Description |
---|---|
Caching | π§ TODO π§ |
Changing Programming Language | π§ TODO π§ |
Code Optimisation | π§ TODO π§ |
Increase Server RAM | π§ TODO π§ |
Change Server Processor Type | π§ TODO π§ |
Increase Number of Cores | π§ TODO π§ |
Horizontal scaling notices that there is a limit to how many resources you can dedicate to a single instance and therefore utilises the resources of other instances to meet the demand.
Approaches | Description |
---|---|
Moving to a Microservice Architecture | π§ TODO π§ |
Ansible Provisioning | π§ TODO π§ |
Cloud Provisioning | π§ TODO π§ |
Infrastructure-as-a-Code with Terraform | π§ TODO π§ |
Kubernetes Pod Scaling | π§ TODO π§ |
Docker Swarm | π§ TODO π§ |
Serverless Cloud Functions | π§ TODO π§ |
When a system contains different services that are ran as separate processes, they may need to communicate with each other. This can be achieved by using a messaging system. Even if storage on individual services may be large, a system is bottlenecked by the bandwidth of data transfer.
The format of a message can be important in efficiency based on the use case.
Approaches | Description |
---|---|
HTTP REST | π§ TODO π§ |
Websockets | π§ TODO π§ |
Data Streaming | π§ TODO π§ |
gRPC | π§ TODO π§ |
On-Trigger Cloud Functions | π§ TODO π§ |
There are more specialised messaging systems that can be used to deliver messages based on need. These generally tend to be towards several services that may be dynamically scaled. Maintenance of updating the endpoint to call can be a bottleneck in developer resources too.
Approaches | Description |
---|---|
Inline API Calls | π§ TODO π§ |
API Gateways | π§ TODO π§ |
Bi-directional APIs with Pusher | π§ TODO π§ |
RabbitMQ Messaging Queues | π§ TODO π§ |
Serverless Job Scheduling with Quirrel | π§ TODO π§ |
Message Brokers on Apache Kafka | π§ TODO π§ |
Google Pub/Sub | π§ TODO π§ |
Istio and Service Meshes | π§ TODO π§ |
Data is fundamental to an application. Being able to store and later retrieve data instead of having to recompute calculations is key for processors to not need to re-calculate data. State management also comes under this.
OLTP is a class of storage that is designed to be used for transactional processing - your general everyday many-reads-and-many-writes workload needed for users. This needs to be handled consistently yet efficiently.
Approaches | Description |
---|---|
In-Memory | π§ TODO π§ |
Redis Caching | π§ TODO π§ |
PostgresDB | π§ TODO π§ |
MongoDB | π§ TODO π§ |
Cassandra | π§ TODO π§ |
Search Engine Elasticsearch | π§ TODO π§ |
PgBouncer for PostgresDB | π§ TODO π§ |
PgPool | π§ TODO π§ |
Database Sharding | π§ TODO π§ |
Cloud Databases | π§ TODO π§ |
Google Cloud SQL | π§ TODO π§ |
Amazon DynamoDB | π§ TODO π§ |
Mission-critical Transactional Consistency with Google Spanner | π§ TODO π§ |
Large-Scale Low-Latency with Google Cloud Bigtable | π§ TODO π§ |
OLAP is a class of storage that is designed to be used for producing business analytics - read queries on the database tend to make up the majority of your workload, normally across large amounts of data.
Approaches | Description |
---|---|
General Databases | π§ TODO π§ |
Elasticsearch | π§ TODO π§ |
Apache Hadoop | π§ TODO π§ |
Data Warehouses | π§ TODO π§ |
Data Lakes | π§ TODO π§ |
Some data may be read very rarely, and is not needed for the day-to-day operations of an application. This is where archival comes in.
Approaches | Description |
---|---|
General Databases | π§ TODO π§ |
Cold Storage | π§ TODO π§ |
Arweave: Archiving on the Blockchain | π§ TODO π§ |
Security is a very important part of any application. It is important to have a secure architecture that is easy to maintain and easy to change. This includes being able to scale an authentication and authorisation solution for your application to meet user demands without compromising on security.
Authentication is being able to identify a user for who they are.
Approaches | Description |
---|---|
HTTP Basic Authentication | π§ TODO π§ |
HTTP Digest Authentication | π§ TODO π§ |
Session Cookies | π§ TODO π§ |
Self-contained Tokens with JWTs | π§ TODO π§ |
API Key Authentication | π§ TODO π§ |
Certificate-bound Access Tokens | π§ TODO π§ |
Kubernetes Key Management with Hashicorp Vault | π§ TODO π§ |
One-Key Provisioning | π§ TODO π§ |
Key Distribution Servers | π§ TODO π§ |
Authorisation is being able to determine whether a user is allowed to perform certain action.
Approaches | Description |
---|---|
Resource Owner Password Credentials (ROPC) | π§ TODO π§ |
OAuth2 | π§ TODO π§ |
OpenID Connect | π§ TODO π§ |
Lightweight Directory Access Protocol (LDAP) | π§ TODO π§ |
Capability URIs and Macaroons | π§ TODO π§ |
Another part of security is rate-limiting. This is a way of limiting the number of requests made by a user to a particular resource. This is useful for preventing denial of service attacks.
Approaches | Description |
---|---|
In-Memory Store | π§ TODO π§ |
Redis | π§ TODO π§ |
Proxy Rate Limiter | π§ TODO π§ |
After implementing new changes, you may find that your application will behave differently, for better or for worst. Adding a scaffold for tests and running them will help you to quickly identify and fix any issues that may arise.
Building a ever-growing list of tests is a good way to test that your application still behaves as expected after every change. Catching any unexpected and potentially nefarious errors ensures that these errors aren't deployed to production.
Approaches | Description |
---|---|
Unit Testing | π§ TODO π§ |
Component Testing | π§ TODO π§ |
Integration Testing | π§ TODO π§ |
End-to-End Load Testing | π§ TODO π§ |
Web Performance Testing | π§ TODO π§ |
For mission-critical software, crashes and bug fixes may be incredibly detrimental. Preemptive testing is a way to find bugs or issues first, with a general frame of expecting the worst to occur.
Approaches | Description |
---|---|
Stress Testing | π§ TODO π§ |
Fuzzing | π§ TODO π§ |
Symbolic Execution | π§ TODO π§ |
Static Analysis | π§ TODO π§ |
Formal Verification | π§ TODO π§ |
Chaos Engineering for Microservices | π§ TODO π§ |
These books and articles have been helpful in my development of this guide: