Skip to content

Conversation

@austin-aryn-ai
Copy link
Contributor

No description provided.

Copy link
Collaborator

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the modularity of this. We probably need a follow-up to make the whole file use the "bytes payload" semantics and to report hit rates.

super().__init__()
region_name, table_name, hash_key_name = self.parse_path(path)
self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key"
self.dynamodb = boto3.resource("dynamodb", region_name=region_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the resource API instead of the non-deprecated client API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know. I definitely don't want to use it if it's deprecated! I will change it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "deprecated" isn't precisely the word:
boto/boto3#3563


parts = path[6:].split("/")
if len(parts) < 2:
raise ValueError("DynamoDB cache paths must have 'region_name' (us-east-1, e.g.) and 'table_name'")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use the current region? Would this even work if the DDB region is different than the current region?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make cross-region calls. It is good to make the region explicit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would we ever want cross-region caching? Certainly not across GDPR boundaries. And even within the US, the network latency may be bad. I feel like this is a detail that's easy to get wrong and we should instead default it to the 99% right answer and encourage use of the default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is useful and we do it all the time. I run this code in us-west-1, but the table I use is typically in us-east-1.

Also, when you run this on a laptop, how do you get the current region? Via an environment variable?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should design for deployment. What you're describing is development and I think we can use an environment variable for laptops. Obviously, development in EC2 should be able to use InstanceMetadataRegionFetcher. On my laptop, I have region set in .aws/config but we don't seem to have code to get that (yet).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an open source project and we can't assume deployment/use on EC2 (alone).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you're saying we should require region here and do the defaulting in the server? I can buy that. And environment variables will work for laptop server development.

from botocore.exceptions import ClientError

BLOCK_SIZE = 1048576 # 1 MiB
PAGE_CACHE_TTL = timedelta(days=10)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems too specific for a file implementing general cache facilities. Can it move, or be renamed DDB_CACHE_TTL? Does it need a mechanism to remain in sync with the 10 * 24 * 3600 below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, wrong name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I replaced 10243600 with DDB_CACHE_TTL.

ddb://<region_name>/<table_name>[/<hash_key_name>]
where 'hash_key_name' defaults to 'hash_key' if left unspecified.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's distinguish between the DDB concept "hash key" and the caching concept "cache key". Unless we're specifically dealing with DDB, I'd avoid "hash key". There's no actual requirement that the "cache key" be any sort of hash; it could be a JSON string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is DDB-specific. I don't think this introduces any confusion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused. I'm pretty sure in this case that the hash_key_name should be cache_key. After all, in a class named DynamoDbCache and a table named partitioner_page_cache, a column named cache_key would seem more expected. If it weren't a reserved word, I'd even suggest the concise key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

def parse_path(path: str) -> tuple[str, str, Optional[str]]:
assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"

parts = path[6:].split("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems safer to do split("/", 2). Then, below, it can just be return tuple(parts). Or, you could decide that returning a list is OK.

if len(parts) < 2:
raise ValueError("DynamoDB cache paths must have 'region_name' (us-east-1, e.g.) and 'table_name'")
if len(parts) == 2:
return parts[0], parts[1], None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return (*parts, None)


return parts[0], parts[1], parts[2]

def get(self, hash_key: str):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we hint the payload as bytes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

except ClientError as error:
logging.error(f"Error calling get_item({key}) on {self.table_name} : {error}")

self.total_accesses += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often caches will account separately for hits and misses and only total them up when reporting the hit rate. This can be superior as it's fewer increments at probe-time and uses the integer range more frugally. If multiple threads will use the cache, these counters either need a mutex or thread-local approach.

All the cache implementations should have a get_hit_rate() -> float method. We should log this after every document we process, or every N pages, or whatever, in the partitioner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_hit_rate is in the parent class. It does rely on each implementation keeping track of hits and total.

Copy link
Collaborator

@alexaryn alexaryn Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon reflection, we should rip out the get_hit_rate() stuff entirely (while we still can) and replace with get_hit_info() that returns a tuple. Dawn-of-time hit rates are dumb, and can be easily calculated from hits and misses. A better option is get_metrics() which returns a metrics object which we can extend later.

logging.error(f"Error calling get_item({key}) on {self.table_name} : {error}")

self.total_accesses += 1
if res is not None and "Item" in res and "payload" in res["Item"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing redundant lookups. Why not...

if res:
    if item := res.get("Item"):
        if payload := item.get("payload"):
            with self.mutex:
                self.hits += 1
            return payload.value
with self.mutex:
    self.misses += 1
return None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

from botocore.exceptions import ClientError

BLOCK_SIZE = 1048576 # 1 MiB
DDB_CACHE_TTL: int = int(timedelta(days=10).total_seconds())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only use of timedelta and it's rather obfuscated. Why not have one of these?

DDB_CACHE_TTL = 10 * 86400  # 10 days in seconds
DDB_CACHE_TTL = 10 * 24 * 60 * 60  # 10 days in seconds

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

self.cache_hits = 0
self.total_accesses = 0
self.mutex = threading.Lock()
self.cache_hits: int = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the Cache class, prefixing with "cache_" is probably not helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

with self.mutex:
self.cache_misses += 1

def get_hit_rate(self) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just call get_hit_info() and then do the arithmetic outside the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just nuke this method in order to force people to avoid problematic dawn-of-time averages.

self.cache_hits += 1
self.total_accesses += 1
self.inc_hits()
return v
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use an else so there's one return.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.

super().__init__()
region_name, table_name, hash_key_name = self.parse_path(path)
self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key"
self.dynamodb = boto3.resource("dynamodb", region_name=region_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps "deprecated" isn't precisely the word:
boto/boto3#3563

return s3_cache_deserializer, (kwargs,)


class DynamoDBCache(Cache):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quibble about camel case. Should be DynamoDbCache to tokenize properly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


super().__init__()
scheme, _, region_name, table_name, hash_key_name = self.parse_path(path)
self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think about it, the less I like the idea of a default here. Make them specify it. Especially if we're not going to default region, table name, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

def parse_path(path: str):
assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"

parts = path.split("/", 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many parts do we expect. I thought 4 or 5. If so, we should set max splits to 4, right? 4 splits yields 5 elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 5 parts, I'd expect to pass 4 to split() and not care if the last component (cache_key) has slashes in it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to 4.


return tuple(parts)

def get(self, hash_key: str) -> Optional[bytes]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be cache_key. Somebody using the cache should not care about DDB data architecture. The same API should work for on-disk, S3, DDB, in-memory, or whatever cache variant they get from the factory. And there's no requirement that the key be a hash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will change it to 'key'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key is good, except with DDB where it's a reserved word and causes problems.


def get(self, hash_key: str) -> Optional[bytes]:
key = {self.hash_key_name: hash_key}
res: dict[Any, Any] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we narrow it down more, like dict[str, Any]? Also, it looks like it can also be None. In fact, it's probably better to initialize it to None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, str.

Copy link
Collaborator

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to allow people to cache None? If so, we could create a special CacheMiss class/object we could return. Under the covers, I don't think all storage layers can distinguish None from a zero-byte payload.

Can you also add NullCache here? It should be trivial. Having it in the factory as null:// will help with test setups.

This is going to be great.

self.misses += 1

def get_hit_rate(self) -> float:
with self.mutex:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a redundant level of locking. With some types of locks, it will result in a deadlock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it from the interface.

with self.mutex:
self.cache_misses += 1

def get_hit_rate(self) -> float:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just nuke this method in order to force people to avoid problematic dawn-of-time averages.

def parse_path(path: str):
assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"

parts = path.split("/", 5)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If 5 parts, I'd expect to pass 4 to split() and not care if the last component (cache_key) has slashes in it.


def get(self, key: str) -> Optional[bytes]:
key = {self.cache_key: key}
res: dict[str, Any] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must be Optional[dict[str, any]] based on later code. if so, I'd avoid creating a useless dict and initialize to None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

return None

def set(self, key: str, value: bytes):
ttl = int(time.time()) + self.ttl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding "now" to TTL, doesn't result in "time to live"; it results in "expiration". Change variable name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

Copy link
Collaborator

@alexaryn alexaryn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In light of recent discussion, if you want to get rid of the region-defaulting stuff and make it the responsibility of whoever sets up the cache, that would be fine, too. Less AWS creep.

assert cache.total_accesses == 0
assert cache.hits == 0
hits, misses = cache.get_hit_info()
assert hits + misses == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just have:

hits, misses = cache.get_hit_info()
assert hits == 0
assert misses == 0

or

assert cache.get_hit_info() == (0, 0)

which doesn't make clear which number is which.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

assert cache.total_accesses == 1
assert cache.hits == 0
hits, misses = cache.get_hit_info()
assert hits + misses == 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be:

hits, misses = cache.get_hit_info()
assert hits == 0
assert misses == 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

assert cache.total_accesses == 2
assert cache.hits == 1
hits, misses = cache.get_hit_info()
assert hits + misses == 2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hits, misses = cache.get_hit_info()
assert hits == 1
assert misses == 1

if cached_layout:
logger.info(f"Cache Hit for ImageToJson. Cache hit-rate is {self.cache.get_hit_rate()}")
hits, misses = self.cache.get_hit_info()
hit_rate = hits / (hits + misses)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this comes up a lot, we probably want to provide a safediv() function somewhere, maybe even in cache.py.

def safediv(n, d):
    return n / d if d else 0

Alternately, we just change these outputs to print the two counts and not the rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

if use_cache and (cached_result := ocr_cache.get(hash_key)):
logger.info(f"Cache Hit for OCR. Cache hit-rate is {ocr_cache.get_hit_rate()}")
hits, misses = ocr_cache.get_hit_info()
hit_rate = hits / (hits + misses)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

div by zero

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will use safediv here.

if not cache_key:
raise ValueError("Missing cache key !!")
self.cache_key = cache_key
region_name = get_region_name()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not region_name:
    region_name = get_region_name()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

def parse_path(path: str):
assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"

parts = path.split("/", 4)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the return value is cleaner if we do:

parts = path[6:].split("/", 2)

Note also that the 2 is one less than the (max) number of resulting elements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> "ddb://my-table"[6:].split("/", 2)
['my-table']
>>> "ddb://my-table//"[6:].split("/", 2)
['my-table', '', '']
>>> "ddb://my-table//cache-key"[6:].split("/", 2)
['my-table', '', 'cache-key']
>>> "ddb://my-table/us-east-1/cache-key"[6:].split("/", 2)
['my-table', 'us-east-1', 'cache-key']

return region

# EC2
with detected_region_lock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs an import from botocore before the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


import diskcache
from botocore.exceptions import ClientError
from botocore.utils import InstanceMetadataRegionFetcher
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this boto stuff needs to be lazy-imported so we don't take a hard dependency on AWS code. The S3 code was sloppy, but the fix is easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

import boto3

super().__init__()
scheme, _, table_name, cache_key = self.parse_path(path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should just return table_name, region_name, cache_key. The rest is dirty laundry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Nov 12, 2025

Found 3 test failures on Blacksmith runners:

Failures

Test View Logs
test_data_extraction/
test_extract_properties_from_dict_schema[llm1] - anthropic.NotFoundError:
View Logs
test_data_extraction/
test_extract_properties_from_schema[llm1] - anthropic.NotFoundError:
View Logs
test_summarize_images/test_summarize_images_anthropic_claude - anthropic.NotFoundError: View Logs


Fix in Cursor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants