-
Notifications
You must be signed in to change notification settings - Fork 0
Caching
The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently
After receiving a request, a web server first checks if the cache has the available response. If it has, it sends data back to the client. If not, it queries the database, stores the response in cache, and sends it back to the client. This caching strategy is called a read-through cache. Other caching strategies are available depending on the data type, size, and access patterns. A previous study explains how different caching strategies work [6].
Interacting with cache servers is simple because most cache servers provide APIs for common programming languages. The following code snippet shows typical Memcached APIs:
SECONDS = 1
cache.set('myKey, 'hi there', 3600 * SECONDS)
cache.get('myKey')
Here are a few considerations for using a cache system:
Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data. For instance, if a cache server restarts, all the data in memory is lost. Thus, important data should be saved in persistent data stores.
It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache. When there is no expiration policy, cached data will be stored in the memory permanently. It is advisable not to make the expiration date too short as this will cause the system to reload data from the database too frequently. Meanwhile, it is advisable not to make the expiration date too long as the data can become stale.
This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency between the data store and cache is challenging. For further details, refer to the paper titled “Scaling Memcache at Facebook” published by Facebook [7].
A single cache server represents a potential single point of failure (SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working” [8]. As a result, multiple cache servers across different data centers are recommended to avoid SPOF. Another recommended approach is to overprovision the required memory by certain percentages. This provides a buffer as the memory usage increases.
Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy. Other eviction policies, such as the Least Frequently Used (LFU) or First in First Out (FIFO), can be adopted to satisfy different use cases.
This is perhaps the most commonly used caching approach, at least in the projects that I worked on. The cache sits on the side and the application directly talks to both the cache and the database. There is no connection between the cache and the primary database. All operations to cache and the database are handled by the application.
- The application first checks the cache.
- If the data is found in cache, we’ve cache hit. The data is read and returned to the client.
- If the data is not found in cache, we’ve cache miss. The application has to do some extra work. It queries the database to read the data, returns it to the client and stores the data in cache so the subsequent reads for the same data results in a cache hit.
Cache-aside caches are usually general purpose and work best for read-heavy workloads. Memcached and Redis are widely used. Systems using cache-aside are resilient to cache failures. If the cache cluster goes down, the system can still operate by going directly to the database. (Although, it doesn’t help much if cache goes down during peak load. Response times can become terrible and in worst case, the database can stop working.)
Another benefit is that the data model in cache can be different than the data model in database. E.g. the response generated as a result of multiple queries can be stored against some request id.
When cache-aside is used, the most common write strategy is to write data to the database directly. When this happens, cache may become inconsistent with the database. To deal with this, developers generally use time to live (TTL) and continue serving stale data until TTL expires. If data freshness must be guaranteed, developers either invalidate the cache entry or use an appropriate write strategy, as we’ll explore later.
Read-through cache sits in-line with the database. When there is a cache miss, it loads missing data from database, populates the cache and returns it to the application.
Both cache-aside and read-through strategies load data lazily, that is, only when it is first read.
While read-through and cache-aside are very similar, there are at least two key differences:
- In cache-aside, the application is responsible for fetching data from the database and populating the cache. In read-through, this logic is usually supported by the library or stand-alone cache provider.
- Unlike cache-aside, the data model in read-through cache cannot be different than that of the database.
Read-through caches work best for read-heavy workloads when the same data is requested many times. For example, a news story. The disadvantage is that when the data is requested the first time, it always results in cache miss and incurs the extra penalty of loading data to the cache. Developers deal with this by ‘warming’ or ‘pre-heating’ the cache by issuing queries manually. Just like cache-aside, it is also possible for data to become inconsistent between cache and the database, and solution lies in the write strategy, as we’ll see next.
- The application writes the data directly to the cache.
- The cache updates the data in the main database. When the write is complete, both the cache and the database have the same value and the cache always remains consistent.
On its own, write-through caches don’t seem to do much, in fact, they introduce extra write latency because data is written to the cache first and then to the main database (two write operations.) But when paired with read-through caches, we get all the benefits of read-through and we also get data consistency guarantee, freeing us from using cache invalidation (assuming ALL writes to the database go through the cache.)
DynamoDB Accelerator (DAX) is a good example of read-through / write-through cache. It sits inline with DynamoDB and your application. Reads and writes to DynamoDB can be done through DAX. (Side note: If you are planning to use DAX, please make sure you familiarize yourself with its data consistency model and how it interplays with DynamoDB.)
Here, data is written directly to the database and only the data that is read makes it way into the cache.
Write-around can be combine with read-through and provides good performance in situations where data is written once and read less frequently or never. For example, real-time logs or chatroom messages. Likewise, this pattern can be combined with cache-aside as well.
Here, the application writes data to the cache which stores the data and acknowledges to the application immediately. Then later, the cache writes the data back to the database.
This is very similar to to Write-Through but there’s one crucial difference: In Write-Through, the data written to the cache is synchronously updated in the main database. In Write-Back, the data written to the cache is asynchronously updated in the main database. From the application perspective, writes to Write-Back caches are faster because only the cache needed to be updated before returning a response.
Write back caches improve the write performance and are good for write-heavy workloads. When combined with read-through, it works good for mixed workloads, where the most recently updated and accessed data is always available in cache.
It’s resilient to database failures and can tolerate some database downtime. If batching or coalescing is supported, it can reduce overall writes to the database, which decreases the load and reduces costs, if the database provider charges by number of requests e.g. DynamoDB. Keep in mind that DAX is write-through so you won’t see any reductions in costs if your application is write heavy. (When I first heard of DAX, this was my first question - DynamoDB can be very expensive, but damn you Amazon.)
Some developers use Redis for both cache-aside and write-back to better absorb spikes during peak load. The main disadvantage is that if there’s a cache failure, the data may be permanently lost.
Most relational databases storage engines (i.e. InnoDB) have write-back cache enabled by default in their internals. Queries are first written to memory and eventually flushed to the disk.
- Calculating whose friends are online using sets.
- Memcached on steroids.
- Distributed lock manager for process coordination.
- Full text inverted index lookups.
- Tag clouds.
- Leaderboards. Sorted sets for maintaining high score tables.
- Circular log buffers.
- Database for university course availability information. If the set contains the course ID it has an open seat. Data is scraped and processed continuously and there are ~7200 courses.
- Server for backed sessions. A random cookie value which is then associated with a larger chunk of serialized data on the server) are a very poor fit for relational databases. They are often created for every visitor, even those who stumble in from Google and then leave, never to return again. They then hang around for weeks taking up valuable database space. They are never queried by anything other than their primary key.
- Fast, atomically incremented counters are a great fit for offering real-time statistics.
- Polling the database every few seconds. Cheap in a key-value store. If you're sharding your data you'll need a central lookup service for quickly determining which shard is being used for a specific user's data. A replicated Redis cluster is a great solution here - GitHub use exactly that to manage sharding their many repositories between different backend file servers.
- Transient data. Any transient data used by your application is also a good fit for Redis. CSRF tokens (to prove a POST submission came from a form you served up, and not a form on a malicious third party site, need to be stored for a short while, as does handshake data for various security protocols.
- Incredibly easy to set up and ridiculously fast (30,000 read or writes a second on a laptop with the default configuration)
- Share state between processes. Run a long running batch job in one Python interpreter (say loading a few million lines of CSV in to a Redis key/value lookup table) and run another interpreter to play with the data that’s already been collected, even as the first process is streaming data in. You can quit and restart my interpreters without losing any data.
- Create heat maps of the BNP’s membership list for the Guardian
- Redis semantics map closely to Python native data types, you don’t have to think for more than a few seconds about how to represent data.
- That’s a simple capped log implementation (similar to a MongoDB capped collection)—push items on to the tail of a ’log’ key and use ltrim to only retain the last X items. You could use this to keep track of what a system is doing right now without having to worry about storing ever increasing amounts of logging information.
- An interesting example of an application built on Redis is Hurl, a tool for debugging HTTP requests built in 48 hours by Leah Culver and Chris Wanstrath.
- It’s common to use MySQL as the backend for storing and retrieving what are essentially key/value pairs. I’ve seen this over-and-over when someone needs to maintain a bit of state, session data, counters, small lists, and so on. When MySQL isn’t able to keep up with the volume, we often turn to memcached as a write-thru cache. But there’s a bit of a mis-match at work here.
- With sets, we can also keep track of ALL of the IDs that have been used for records in the system.
- Quickly pick a random item from a set.
- API limiting. This is a great fit for Redis as a rate limiting check needs to be made for every single API hit, which involves both reading and writing short-lived data.
- A/B testing is another perfect task for Redis - it involves tracking user behaviour in real-time, making writes for every navigation action a user takes, storing short-lived persistent state and picking random items.
- Implementing the inbox method with Redis is simple: each user gets a queue (a capped queue if you're worried about memory running out) to work as their inbox and a set to keep track of the other users who are following them. Ashton Kutcher has over 5,000,000 followers on Twitter - at 100,000 writes a second it would take less than a minute to fan a message out to all of those inboxes.
- Publish/subscribe is perfect for this broadcast updates (such as election results) to hundreds of thousands of simultaneously connected users. Blocking queue primitives mean message queues without polling.
- Have workers periodically report their load average in to a sorted set.
- Redistribute load. When you want to issue a job, grab the three least loaded workers from the sorted set and pick one of them at random (to avoid the thundering herd problem).
- Multiple GIS indexes.
- Recommendation engine based on relationships.
- Web-of-things data flows.
- Social graph representation.
- Dynamic schemas so schemas don't have to be designed up-front. Building the data model in code, on the fly by adding properties and relationships, dramatically simplifies code.
- Reducing the impedance mismatch because the data model in the database can more closely match the data model in the application.
Tutorial
https://static.simonwillison.net/static/2010/redis-tutorial/
- http://highscalability.com/blog/2010/12/6/what-the-heck-are-you-actually-using-nosql-for.html
- https://static.simonwillison.net/static/2010/redis-tutorial/
- Hacker News and Reddit thread on this Post. More use case suggestions there.
- List of NoSQL Systems
- Pig - infrastructure to support ad-hoc analysis of very large data sets.
- Digging Deeper Into Data With Hadoop By Gary Orenstein
- NoSQL East 2009 - Summary of Day 1 by Eivind Uggedal
- The “NoSQL” approach: struggling to see the benefits by Neil Saunders
- Design Patterns for DistributedNon-Relational Databases by Todd Lipcon.
- The Future Is Big Data in the Cloud By Ping Li
- One size does not fit all: “document stores”, “nosql databases” , ODBMSs by Roberto V. Zicari
- NoSQL is for niches By Dana Blankenhorn.
- MongoDB Use Cases
- MongoDB and Ecommerce by Kyle Banker
- Archiving - a good MongoDB use case?
- Five Reasons to Use NoSQL by Jeremiah Peschka
- The Business Case for NoSQL, NoETL and NoProblems by Loraine Lawson
- Is It Time For NoETL? by Seth Grimes
- Holy Large Hadron Collider, Batman!
- I Can't Wait for NoSQL to Die by Ted Dziuba.
- NoSQL Basics, Benefits and Best-Fit Scenarios by Curt Monash
- Redis Tutorial by Simon Willison
- Why I Like Redis by Simon Willison
- Redis: Lightweight key/value Store That Goes the Extra Mile by Jeremy Zawodny
- Remote Dictionary Server
- Is NoSQL for me? I’m just a small fish by Hadi Hariri
- Why Big Enterprises are Interested in NoSQL by Jon Moore
- Visual Guide to NoSQL Systems by Nathan Hurst
- You Can't Sacrifice Partition Tolerance by Coda Hale
- NoSQL Netflix Use Case Comparison by Adrian Cockcroft
- Comparison guide of horizontally scalable datastores by Rick Cattell
- Weak Consistency and CAP Implications by Ilya Grigorik
- Schema-Free MySQL vs NoSQL by Ilya Grigorik
- Neo4j - 5 Cool Graph Examples
- Has anyone used Graph-based Databases
- Neotechnology Uses Cases
- CouchOne Ataxo Case Study
- NOSQL: scaling to size and scaling to complexity by Emil Eifrem
- NoSQL, NoProblem (Not Really… but it’s still awesome) by Jeremy Pinkham
- An Expert's Guide to Oracle Technology by Lewis Cunningham
- NoSQL vs SQL, Why Not Both? By Alaric Snell-Pym
- Databases: relational vs object vs graph vs document by On Target
- Use cases are driving the divergence, and the convergence, of NoSQL solutions by James Phillips
- The beginning of the end of NoSQL by Matthew Aslett
- Going NoSQL with MongoDB by Ted Neward
- NoSQL Ecosystem by Jonathan Ellis
- The New Dimension of NoSQL Scalability: Complexity by Alex Popescu
- Quick Reference to Alternative data storages by Alex Popescu
- 6 Reasons Why Relational Database Will Be Superseded by Robin Bloor
- NoSQL Misconceptions by Ben Scofield
- SQL Databases Don't Scale by Adam Wiggins
- “One Size Fits All”: An Idea Whose Time Has Come and Gone by Michael Stonebraker and Uğur Çetintemel
- Future of RDBMS is RAM Clouds & SSD by Ilya Grigorik
- To scale or not to scale: Key/Value, Document, SQL, JPA by Uri Cohen