Add a new cache implementation using DynamoDB #1470

austin-aryn-ai · 2025-09-19T16:02:42Z

No description provided.

alexaryn

I really like the modularity of this. We probably need a follow-up to make the whole file use the "bytes payload" semantics and to report hit rates.

alexaryn · 2025-09-23T21:00:45Z

lib/sycamore/sycamore/utils/cache.py

+        super().__init__()
+        region_name, table_name, hash_key_name = self.parse_path(path)
+        self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key"
+        self.dynamodb = boto3.resource("dynamodb", region_name=region_name)


Why use the resource API instead of the non-deprecated client API?

I didn't know. I definitely don't want to use it if it's deprecated! I will change it.

Perhaps "deprecated" isn't precisely the word:
boto/boto3#3563

alexaryn · 2025-09-23T21:02:06Z

lib/sycamore/sycamore/utils/cache.py

+
+        parts = path[6:].split("/")
+        if len(parts) < 2:
+            raise ValueError("DynamoDB cache paths must have 'region_name' (us-east-1, e.g.) and 'table_name'")


Why not just use the current region? Would this even work if the DDB region is different than the current region?

You can make cross-region calls. It is good to make the region explicit.

But would we ever want cross-region caching? Certainly not across GDPR boundaries. And even within the US, the network latency may be bad. I feel like this is a detail that's easy to get wrong and we should instead default it to the 99% right answer and encourage use of the default.

This is useful and we do it all the time. I run this code in us-west-1, but the table I use is typically in us-east-1.

Also, when you run this on a laptop, how do you get the current region? Via an environment variable?

In my opinion, we should design for deployment. What you're describing is development and I think we can use an environment variable for laptops. Obviously, development in EC2 should be able to use InstanceMetadataRegionFetcher. On my laptop, I have region set in .aws/config but we don't seem to have code to get that (yet).

This is an open source project and we can't assume deployment/use on EC2 (alone).

So you're saying we should require region here and do the defaulting in the server? I can buy that. And environment variables will work for laptop server development.

alexaryn · 2025-09-23T21:03:30Z

lib/sycamore/sycamore/utils/cache.py

 from botocore.exceptions import ClientError

 BLOCK_SIZE = 1048576  # 1 MiB
+PAGE_CACHE_TTL = timedelta(days=10)


This seems too specific for a file implementing general cache facilities. Can it move, or be renamed DDB_CACHE_TTL? Does it need a mechanism to remain in sync with the 10 * 24 * 3600 below?

Yes, wrong name.

Yes, I replaced 10243600 with DDB_CACHE_TTL.

alexaryn · 2025-09-23T21:06:52Z

lib/sycamore/sycamore/utils/cache.py

+
+    ddb://<region_name>/<table_name>[/<hash_key_name>]
+
+    where 'hash_key_name' defaults to 'hash_key' if left unspecified.


Let's distinguish between the DDB concept "hash key" and the caching concept "cache key". Unless we're specifically dealing with DDB, I'd avoid "hash key". There's no actual requirement that the "cache key" be any sort of hash; it could be a JSON string.

This is DDB-specific. I don't think this introduces any confusion.

I was confused. I'm pretty sure in this case that the hash_key_name should be cache_key. After all, in a class named DynamoDbCache and a table named partitioner_page_cache, a column named cache_key would seem more expected. If it weren't a reserved word, I'd even suggest the concise key.

alexaryn · 2025-09-23T21:09:55Z

lib/sycamore/sycamore/utils/cache.py

+    def parse_path(path: str) -> tuple[str, str, Optional[str]]:
+        assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"
+
+        parts = path[6:].split("/")


Seems safer to do split("/", 2). Then, below, it can just be return tuple(parts). Or, you could decide that returning a list is OK.

alexaryn · 2025-09-23T21:10:59Z

lib/sycamore/sycamore/utils/cache.py

+        if len(parts) < 2:
+            raise ValueError("DynamoDB cache paths must have 'region_name' (us-east-1, e.g.) and 'table_name'")
+        if len(parts) == 2:
+            return parts[0], parts[1], None


return (*parts, None)

alexaryn · 2025-09-23T21:12:27Z

lib/sycamore/sycamore/utils/cache.py

+
+        return parts[0], parts[1], parts[2]
+
+    def get(self, hash_key: str):


Can we hint the payload as bytes?

alexaryn · 2025-09-23T21:21:58Z

lib/sycamore/sycamore/utils/cache.py

+        except ClientError as error:
+            logging.error(f"Error calling get_item({key}) on {self.table_name} : {error}")
+
+        self.total_accesses += 1


Often caches will account separately for hits and misses and only total them up when reporting the hit rate. This can be superior as it's fewer increments at probe-time and uses the integer range more frugally. If multiple threads will use the cache, these counters either need a mutex or thread-local approach.

All the cache implementations should have a get_hit_rate() -> float method. We should log this after every document we process, or every N pages, or whatever, in the partitioner.

get_hit_rate is in the parent class. It does rely on each implementation keeping track of hits and total.

Upon reflection, we should rip out the get_hit_rate() stuff entirely (while we still can) and replace with get_hit_info() that returns a tuple. Dawn-of-time hit rates are dumb, and can be easily calculated from hits and misses. A better option is get_metrics() which returns a metrics object which we can extend later.

alexaryn · 2025-09-23T21:25:45Z

lib/sycamore/sycamore/utils/cache.py

+            logging.error(f"Error calling get_item({key}) on {self.table_name} : {error}")
+
+        self.total_accesses += 1
+        if res is not None and "Item" in res and "payload" in res["Item"]:


This is doing redundant lookups. Why not...

if res: if item := res.get("Item"): if payload := item.get("payload"): with self.mutex: self.hits += 1 return payload.value with self.mutex: self.misses += 1 return None

alexaryn · 2025-09-26T22:48:25Z

lib/sycamore/sycamore/utils/cache.py

 from botocore.exceptions import ClientError

 BLOCK_SIZE = 1048576  # 1 MiB
+DDB_CACHE_TTL: int = int(timedelta(days=10).total_seconds())


This is the only use of timedelta and it's rather obfuscated. Why not have one of these?

DDB_CACHE_TTL = 10 * 86400 # 10 days in seconds DDB_CACHE_TTL = 10 * 24 * 60 * 60 # 10 days in seconds

alexaryn · 2025-09-26T22:49:35Z

lib/sycamore/sycamore/utils/cache.py

-        self.cache_hits = 0
-        self.total_accesses = 0
+        self.mutex = threading.Lock()
+        self.cache_hits: int = 0


Inside the Cache class, prefixing with "cache_" is probably not helpful.

alexaryn · 2025-09-26T23:13:40Z

lib/sycamore/sycamore/utils/cache.py

+        with self.mutex:
+            self.cache_misses += 1
+
+    def get_hit_rate(self) -> float:


This should just call get_hit_info() and then do the arithmetic outside the lock.

I'd just nuke this method in order to force people to avoid problematic dawn-of-time averages.

alexaryn · 2025-09-26T23:14:39Z

lib/sycamore/sycamore/utils/cache.py

-            self.cache_hits += 1
-        self.total_accesses += 1
+            self.inc_hits()
+            return v


Could use an else so there's one return.

alexaryn · 2025-09-29T21:16:43Z

lib/sycamore/sycamore/utils/cache.py

+        super().__init__()
+        region_name, table_name, hash_key_name = self.parse_path(path)
+        self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key"
+        self.dynamodb = boto3.resource("dynamodb", region_name=region_name)


Perhaps "deprecated" isn't precisely the word:
boto/boto3#3563

alexaryn · 2025-09-29T21:35:43Z

lib/sycamore/sycamore/utils/cache.py

        return s3_cache_deserializer, (kwargs,)


+class DynamoDBCache(Cache):


Quibble about camel case. Should be DynamoDbCache to tokenize properly.

alexaryn · 2025-09-29T21:37:23Z

lib/sycamore/sycamore/utils/cache.py

+
+        super().__init__()
+        scheme, _, region_name, table_name, hash_key_name = self.parse_path(path)
+        self.hash_key_name = hash_key_name if hash_key_name is not None else "hash_key"


The more I think about it, the less I like the idea of a default here. Make them specify it. Especially if we're not going to default region, table name, etc.

alexaryn · 2025-09-29T21:44:18Z

lib/sycamore/sycamore/utils/cache.py

+    def parse_path(path: str):
+        assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"
+
+        parts = path.split("/", 5)


How many parts do we expect. I thought 4 or 5. If so, we should set max splits to 4, right? 4 splits yields 5 elements.

If 5 parts, I'd expect to pass 4 to split() and not care if the last component (cache_key) has slashes in it.

Changed to 4.

alexaryn · 2025-09-29T21:47:49Z

lib/sycamore/sycamore/utils/cache.py

+
+        return tuple(parts)
+
+    def get(self, hash_key: str) -> Optional[bytes]:


This should be cache_key. Somebody using the cache should not care about DDB data architecture. The same API should work for on-disk, S3, DDB, in-memory, or whatever cache variant they get from the factory. And there's no requirement that the key be a hash.

i will change it to 'key'.

key is good, except with DDB where it's a reserved word and causes problems.

alexaryn · 2025-09-29T21:50:07Z

lib/sycamore/sycamore/utils/cache.py

+
+    def get(self, hash_key: str) -> Optional[bytes]:
+        key = {self.hash_key_name: hash_key}
+        res: dict[Any, Any] = {}


Can we narrow it down more, like dict[str, Any]? Also, it looks like it can also be None. In fact, it's probably better to initialize it to None.

alexaryn

I wonder if we want to allow people to cache None? If so, we could create a special CacheMiss class/object we could return. Under the covers, I don't think all storage layers can distinguish None from a zero-byte payload.

Can you also add NullCache here? It should be trivial. Having it in the factory as null:// will help with test setups.

This is going to be great.

alexaryn · 2025-10-02T21:48:42Z

lib/sycamore/sycamore/utils/cache.py

+            self.misses += 1
+
+    def get_hit_rate(self) -> float:
+        with self.mutex:


This is a redundant level of locking. With some types of locks, it will result in a deadlock.

I'll remove it from the interface.

alexaryn · 2025-10-02T22:16:17Z

lib/sycamore/sycamore/utils/cache.py

+        with self.mutex:
+            self.cache_misses += 1
+
+    def get_hit_rate(self) -> float:


I'd just nuke this method in order to force people to avoid problematic dawn-of-time averages.

alexaryn · 2025-10-02T22:30:50Z

lib/sycamore/sycamore/utils/cache.py

+    def parse_path(path: str):
+        assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"
+
+        parts = path.split("/", 5)


If 5 parts, I'd expect to pass 4 to split() and not care if the last component (cache_key) has slashes in it.

alexaryn · 2025-10-02T22:33:16Z

lib/sycamore/sycamore/utils/cache.py

+
+    def get(self, key: str) -> Optional[bytes]:
+        key = {self.cache_key: key}
+        res: dict[str, Any] = {}


This must be Optional[dict[str, any]] based on later code. if so, I'd avoid creating a useless dict and initialize to None.

alexaryn · 2025-10-02T22:36:30Z

lib/sycamore/sycamore/utils/cache.py

+        return None
+
+    def set(self, key: str, value: bytes):
+        ttl = int(time.time()) + self.ttl


Adding "now" to TTL, doesn't result in "time to live"; it results in "expiration". Change variable name?

alexaryn

In light of recent discussion, if you want to get rid of the region-defaulting stuff and make it the responsibility of whoever sets up the cache, that would be fine, too. Less AWS creep.

alexaryn · 2025-10-03T03:37:43Z

lib/sycamore/sycamore/tests/unit/llms/test_bedrock.py

-        assert cache.total_accesses == 0
+        assert cache.hits == 0
+        hits, misses = cache.get_hit_info()
+        assert hits + misses == 0


We should just have:

hits, misses = cache.get_hit_info() assert hits == 0 assert misses == 0

or

assert cache.get_hit_info() == (0, 0)

which doesn't make clear which number is which.

alexaryn · 2025-10-03T03:38:36Z

lib/sycamore/sycamore/tests/unit/llms/test_bedrock.py

-        assert cache.total_accesses == 1
+        assert cache.hits == 0
+        hits, misses = cache.get_hit_info()
+        assert hits + misses == 1


Should be:

hits, misses = cache.get_hit_info() assert hits == 0 assert misses == 1

alexaryn · 2025-10-03T03:40:42Z

lib/sycamore/sycamore/tests/unit/llms/test_bedrock.py

-        assert cache.total_accesses == 2
+        assert cache.hits == 1
+        hits, misses = cache.get_hit_info()
+        assert hits + misses == 2


hits, misses = cache.get_hit_info() assert hits == 1 assert misses == 1

alexaryn · 2025-10-03T03:46:04Z

lib/sycamore/sycamore/transforms/detr_partitioner.py

            if cached_layout:
-                logger.info(f"Cache Hit for ImageToJson. Cache hit-rate is {self.cache.get_hit_rate()}")
+                hits, misses = self.cache.get_hit_info()
+                hit_rate = hits / (hits + misses)


Since this comes up a lot, we probably want to provide a safediv() function somewhere, maybe even in cache.py.

def safediv(n, d): return n / d if d else 0

Alternately, we just change these outputs to print the two counts and not the rate.

alexaryn · 2025-10-03T03:48:05Z

lib/sycamore/sycamore/transforms/text_extraction/ocr_models.py

        if use_cache and (cached_result := ocr_cache.get(hash_key)):
-            logger.info(f"Cache Hit for OCR. Cache hit-rate is {ocr_cache.get_hit_rate()}")
+            hits, misses = ocr_cache.get_hit_info()
+            hit_rate = hits / (hits + misses)


div by zero

will use safediv here.

alexaryn · 2025-10-03T04:08:22Z

lib/sycamore/sycamore/utils/cache.py

+        if not cache_key:
+            raise ValueError("Missing cache key !!")
+        self.cache_key = cache_key
+        region_name = get_region_name()


if not region_name: region_name = get_region_name()

alexaryn · 2025-10-03T04:11:12Z

lib/sycamore/sycamore/utils/cache.py

+    def parse_path(path: str):
+        assert path.startswith("ddb://"), "DynamoDB cache paths must start with ddb://"
+
+        parts = path.split("/", 4)


I think the return value is cleaner if we do:

parts = path[6:].split("/", 2)

Note also that the 2 is one less than the (max) number of resulting elements.

>>> "ddb://my-table"[6:].split("/", 2) ['my-table'] >>> "ddb://my-table//"[6:].split("/", 2) ['my-table', '', ''] >>> "ddb://my-table//cache-key"[6:].split("/", 2) ['my-table', '', 'cache-key'] >>> "ddb://my-table/us-east-1/cache-key"[6:].split("/", 2) ['my-table', 'us-east-1', 'cache-key']

alexaryn · 2025-10-03T04:13:01Z

lib/sycamore/sycamore/utils/cache.py

+        return region
+
+    # EC2
+    with detected_region_lock:


Needs an import from botocore before the lock.

alexaryn · 2025-10-03T04:14:12Z

lib/sycamore/sycamore/utils/cache.py


 import diskcache
 from botocore.exceptions import ClientError
+from botocore.utils import InstanceMetadataRegionFetcher


All this boto stuff needs to be lazy-imported so we don't take a hard dependency on AWS code. The S3 code was sloppy, but the fix is easy.

alexaryn · 2025-10-03T04:17:28Z

lib/sycamore/sycamore/utils/cache.py

+        import boto3
+
+        super().__init__()
+        scheme, _, table_name, cache_key = self.parse_path(path)


This should just return table_name, region_name, cache_key. The rest is dirty laundry.

blacksmith-sh · 2025-11-12T21:47:01Z

Found 3 test failures on Blacksmith runners:

Failures

Test	View Logs
`test_data_extraction/` `test_extract_properties_from_dict_schema[llm1] - anthropic.NotFoundError:`	View Logs
`test_data_extraction/` `test_extract_properties_from_schema[llm1] - anthropic.NotFoundError:`	View Logs
`test_summarize_images/test_summarize_images_anthropic_claude - anthropic.NotFoundError:`	View Logs

Add a new cache implementation using DynamoDB

bcf161a

austin-aryn-ai requested a review from alexaryn September 19, 2025 16:02

austin-aryn-ai added 3 commits September 19, 2025 11:40

Fix mypy

1271133

Fix mypy

bcb2d9d

Handle binary return type

840eb0c

alexaryn reviewed Sep 23, 2025

View reviewed changes

austin-aryn-ai added 3 commits September 23, 2025 16:26

Address comments

56d02ad

Fix bugs

47a34b6

Allow cache to return hits and misses

1e3fa5b

alexaryn reviewed Sep 29, 2025

View reviewed changes

austin-aryn-ai added 2 commits September 29, 2025 21:46

Address more comments

9812ff9

Remove 'region_name' from ddb cache path

9ac50df

alexaryn reviewed Oct 2, 2025

View reviewed changes

austin-aryn-ai added 2 commits October 2, 2025 16:29

Address comments; fix lint; fix ut

84403b9

More bug fixes

975f90b

alexaryn reviewed Oct 3, 2025

View reviewed changes

Address comments

2349d35

alexaryn mentioned this pull request Oct 10, 2025

Sycamore: disable caching via NullCache. #1493

Merged

austin-aryn-ai added 2 commits November 12, 2025 09:29

Merge branch 'main' into austin/ddb-cache

cd45b94

Fix lint

50abfa4


		ddb://<region_name>/<table_name>[/<hash_key_name>]

		where 'hash_key_name' defaults to 'hash_key' if left unspecified.


		return parts[0], parts[1], parts[2]

		def get(self, hash_key: str):

		return s3_cache_deserializer, (kwargs,)


		class DynamoDBCache(Cache):


		return tuple(parts)

		def get(self, hash_key: str) -> Optional[bytes]:

Add a new cache implementation using DynamoDB #1470

Are you sure you want to change the base?

Add a new cache implementation using DynamoDB #1470

Uh oh!

Conversation

austin-aryn-ai commented Sep 19, 2025

Uh oh!

alexaryn left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexaryn Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexaryn Oct 2, 2025 •

edited

Loading