-
Notifications
You must be signed in to change notification settings - Fork 65
Add a new cache implementation using DynamoDB #1470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
alexaryn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the modularity of this. We probably need a follow-up to make the whole file use the "bytes payload" semantics and to report hit rates.
lib/sycamore/sycamore/utils/cache.py
Outdated
| super().__init__() | ||
| region_name, table_name, hash_key_name = self.parse_path(path) | ||
| self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key" | ||
| self.dynamodb = boto3.resource("dynamodb", region_name=region_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use the resource API instead of the non-deprecated client API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know. I definitely don't want to use it if it's deprecated! I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "deprecated" isn't precisely the word:
boto/boto3#3563
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| parts = path[6:].split("/") | ||
| if len(parts) < 2: | ||
| raise ValueError("DynamoDB cache paths must have 'region_name' (us-east-1, e.g.) and 'table_name'") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use the current region? Would this even work if the DDB region is different than the current region?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make cross-region calls. It is good to make the region explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But would we ever want cross-region caching? Certainly not across GDPR boundaries. And even within the US, the network latency may be bad. I feel like this is a detail that's easy to get wrong and we should instead default it to the 99% right answer and encourage use of the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is useful and we do it all the time. I run this code in us-west-1, but the table I use is typically in us-east-1.
Also, when you run this on a laptop, how do you get the current region? Via an environment variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, we should design for deployment. What you're describing is development and I think we can use an environment variable for laptops. Obviously, development in EC2 should be able to use InstanceMetadataRegionFetcher. On my laptop, I have region set in .aws/config but we don't seem to have code to get that (yet).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an open source project and we can't assume deployment/use on EC2 (alone).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you're saying we should require region here and do the defaulting in the server? I can buy that. And environment variables will work for laptop server development.
lib/sycamore/sycamore/utils/cache.py
Outdated
| from botocore.exceptions import ClientError | ||
|
|
||
| BLOCK_SIZE = 1048576 # 1 MiB | ||
| PAGE_CACHE_TTL = timedelta(days=10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems too specific for a file implementing general cache facilities. Can it move, or be renamed DDB_CACHE_TTL? Does it need a mechanism to remain in sync with the 10 * 24 * 3600 below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, wrong name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I replaced 10243600 with DDB_CACHE_TTL.
lib/sycamore/sycamore/utils/cache.py
Outdated
| ddb://<region_name>/<table_name>[/<hash_key_name>] | ||
| where 'hash_key_name' defaults to 'hash_key' if left unspecified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's distinguish between the DDB concept "hash key" and the caching concept "cache key". Unless we're specifically dealing with DDB, I'd avoid "hash key". There's no actual requirement that the "cache key" be any sort of hash; it could be a JSON string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is DDB-specific. I don't think this introduces any confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused. I'm pretty sure in this case that the hash_key_name should be cache_key. After all, in a class named DynamoDbCache and a table named partitioner_page_cache, a column named cache_key would seem more expected. If it weren't a reserved word, I'd even suggest the concise key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| def parse_path(path: str) -> tuple[str, str, Optional[str]]: | ||
| assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://" | ||
|
|
||
| parts = path[6:].split("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems safer to do split("/", 2). Then, below, it can just be return tuple(parts). Or, you could decide that returning a list is OK.
lib/sycamore/sycamore/utils/cache.py
Outdated
| if len(parts) < 2: | ||
| raise ValueError("DynamoDB cache paths must have 'region_name' (us-east-1, e.g.) and 'table_name'") | ||
| if len(parts) == 2: | ||
| return parts[0], parts[1], None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (*parts, None)
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| return parts[0], parts[1], parts[2] | ||
|
|
||
| def get(self, hash_key: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we hint the payload as bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| except ClientError as error: | ||
| logging.error(f"Error calling get_item({key}) on {self.table_name} : {error}") | ||
|
|
||
| self.total_accesses += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Often caches will account separately for hits and misses and only total them up when reporting the hit rate. This can be superior as it's fewer increments at probe-time and uses the integer range more frugally. If multiple threads will use the cache, these counters either need a mutex or thread-local approach.
All the cache implementations should have a get_hit_rate() -> float method. We should log this after every document we process, or every N pages, or whatever, in the partitioner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get_hit_rate is in the parent class. It does rely on each implementation keeping track of hits and total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upon reflection, we should rip out the get_hit_rate() stuff entirely (while we still can) and replace with get_hit_info() that returns a tuple. Dawn-of-time hit rates are dumb, and can be easily calculated from hits and misses. A better option is get_metrics() which returns a metrics object which we can extend later.
lib/sycamore/sycamore/utils/cache.py
Outdated
| logging.error(f"Error calling get_item({key}) on {self.table_name} : {error}") | ||
|
|
||
| self.total_accesses += 1 | ||
| if res is not None and "Item" in res and "payload" in res["Item"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is doing redundant lookups. Why not...
if res:
if item := res.get("Item"):
if payload := item.get("payload"):
with self.mutex:
self.hits += 1
return payload.value
with self.mutex:
self.misses += 1
return None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| from botocore.exceptions import ClientError | ||
|
|
||
| BLOCK_SIZE = 1048576 # 1 MiB | ||
| DDB_CACHE_TTL: int = int(timedelta(days=10).total_seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only use of timedelta and it's rather obfuscated. Why not have one of these?
DDB_CACHE_TTL = 10 * 86400 # 10 days in seconds
DDB_CACHE_TTL = 10 * 24 * 60 * 60 # 10 days in seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| self.cache_hits = 0 | ||
| self.total_accesses = 0 | ||
| self.mutex = threading.Lock() | ||
| self.cache_hits: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inside the Cache class, prefixing with "cache_" is probably not helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| with self.mutex: | ||
| self.cache_misses += 1 | ||
|
|
||
| def get_hit_rate(self) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just call get_hit_info() and then do the arithmetic outside the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just nuke this method in order to force people to avoid problematic dawn-of-time averages.
lib/sycamore/sycamore/utils/cache.py
Outdated
| self.cache_hits += 1 | ||
| self.total_accesses += 1 | ||
| self.inc_hits() | ||
| return v |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could use an else so there's one return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it.
lib/sycamore/sycamore/utils/cache.py
Outdated
| super().__init__() | ||
| region_name, table_name, hash_key_name = self.parse_path(path) | ||
| self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key" | ||
| self.dynamodb = boto3.resource("dynamodb", region_name=region_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps "deprecated" isn't precisely the word:
boto/boto3#3563
lib/sycamore/sycamore/utils/cache.py
Outdated
| return s3_cache_deserializer, (kwargs,) | ||
|
|
||
|
|
||
| class DynamoDBCache(Cache): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quibble about camel case. Should be DynamoDbCache to tokenize properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| super().__init__() | ||
| scheme, _, region_name, table_name, hash_key_name = self.parse_path(path) | ||
| self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about it, the less I like the idea of a default here. Make them specify it. Especially if we're not going to default region, table name, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| def parse_path(path: str): | ||
| assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://" | ||
|
|
||
| parts = path.split("/", 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many parts do we expect. I thought 4 or 5. If so, we should set max splits to 4, right? 4 splits yields 5 elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 5 parts, I'd expect to pass 4 to split() and not care if the last component (cache_key) has slashes in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to 4.
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| return tuple(parts) | ||
|
|
||
| def get(self, hash_key: str) -> Optional[bytes]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be cache_key. Somebody using the cache should not care about DDB data architecture. The same API should work for on-disk, S3, DDB, in-memory, or whatever cache variant they get from the factory. And there's no requirement that the key be a hash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will change it to 'key'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
key is good, except with DDB where it's a reserved word and causes problems.
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| def get(self, hash_key: str) -> Optional[bytes]: | ||
| key = {self.hash_key_name: hash_key} | ||
| res: dict[Any, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we narrow it down more, like dict[str, Any]? Also, it looks like it can also be None. In fact, it's probably better to initialize it to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, str.
alexaryn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to allow people to cache None? If so, we could create a special CacheMiss class/object we could return. Under the covers, I don't think all storage layers can distinguish None from a zero-byte payload.
Can you also add NullCache here? It should be trivial. Having it in the factory as null:// will help with test setups.
This is going to be great.
lib/sycamore/sycamore/utils/cache.py
Outdated
| self.misses += 1 | ||
|
|
||
| def get_hit_rate(self) -> float: | ||
| with self.mutex: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a redundant level of locking. With some types of locks, it will result in a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it from the interface.
lib/sycamore/sycamore/utils/cache.py
Outdated
| with self.mutex: | ||
| self.cache_misses += 1 | ||
|
|
||
| def get_hit_rate(self) -> float: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just nuke this method in order to force people to avoid problematic dawn-of-time averages.
lib/sycamore/sycamore/utils/cache.py
Outdated
| def parse_path(path: str): | ||
| assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://" | ||
|
|
||
| parts = path.split("/", 5) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 5 parts, I'd expect to pass 4 to split() and not care if the last component (cache_key) has slashes in it.
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| def get(self, key: str) -> Optional[bytes]: | ||
| key = {self.cache_key: key} | ||
| res: dict[str, Any] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must be Optional[dict[str, any]] based on later code. if so, I'd avoid creating a useless dict and initialize to None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| return None | ||
|
|
||
| def set(self, key: str, value: bytes): | ||
| ttl = int(time.time()) + self.ttl |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding "now" to TTL, doesn't result in "time to live"; it results in "expiration". Change variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
alexaryn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In light of recent discussion, if you want to get rid of the region-defaulting stuff and make it the responsibility of whoever sets up the cache, that would be fine, too. Less AWS creep.
| assert cache.total_accesses == 0 | ||
| assert cache.hits == 0 | ||
| hits, misses = cache.get_hit_info() | ||
| assert hits + misses == 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should just have:
hits, misses = cache.get_hit_info()
assert hits == 0
assert misses == 0
or
assert cache.get_hit_info() == (0, 0)
which doesn't make clear which number is which.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| assert cache.total_accesses == 1 | ||
| assert cache.hits == 0 | ||
| hits, misses = cache.get_hit_info() | ||
| assert hits + misses == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be:
hits, misses = cache.get_hit_info()
assert hits == 0
assert misses == 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
| assert cache.total_accesses == 2 | ||
| assert cache.hits == 1 | ||
| hits, misses = cache.get_hit_info() | ||
| assert hits + misses == 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hits, misses = cache.get_hit_info()
assert hits == 1
assert misses == 1
| if cached_layout: | ||
| logger.info(f"Cache Hit for ImageToJson. Cache hit-rate is {self.cache.get_hit_rate()}") | ||
| hits, misses = self.cache.get_hit_info() | ||
| hit_rate = hits / (hits + misses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this comes up a lot, we probably want to provide a safediv() function somewhere, maybe even in cache.py.
def safediv(n, d):
return n / d if d else 0
Alternately, we just change these outputs to print the two counts and not the rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
| if use_cache and (cached_result := ocr_cache.get(hash_key)): | ||
| logger.info(f"Cache Hit for OCR. Cache hit-rate is {ocr_cache.get_hit_rate()}") | ||
| hits, misses = ocr_cache.get_hit_info() | ||
| hit_rate = hits / (hits + misses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
div by zero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will use safediv here.
lib/sycamore/sycamore/utils/cache.py
Outdated
| if not cache_key: | ||
| raise ValueError("Missing cache key !!") | ||
| self.cache_key = cache_key | ||
| region_name = get_region_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if not region_name:
region_name = get_region_name()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| def parse_path(path: str): | ||
| assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://" | ||
|
|
||
| parts = path.split("/", 4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the return value is cleaner if we do:
parts = path[6:].split("/", 2)
Note also that the 2 is one less than the (max) number of resulting elements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> "ddb://my-table"[6:].split("/", 2)
['my-table']
>>> "ddb://my-table//"[6:].split("/", 2)
['my-table', '', '']
>>> "ddb://my-table//cache-key"[6:].split("/", 2)
['my-table', '', 'cache-key']
>>> "ddb://my-table/us-east-1/cache-key"[6:].split("/", 2)
['my-table', 'us-east-1', 'cache-key']
| return region | ||
|
|
||
| # EC2 | ||
| with detected_region_lock: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs an import from botocore before the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
|
|
||
| import diskcache | ||
| from botocore.exceptions import ClientError | ||
| from botocore.utils import InstanceMetadataRegionFetcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this boto stuff needs to be lazy-imported so we don't take a hard dependency on AWS code. The S3 code was sloppy, but the fix is easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
lib/sycamore/sycamore/utils/cache.py
Outdated
| import boto3 | ||
|
|
||
| super().__init__() | ||
| scheme, _, table_name, cache_key = self.parse_path(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should just return table_name, region_name, cache_key. The rest is dirty laundry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
Found 3 test failures on Blacksmith runners: Failures
|
No description provided.