-
Notifications
You must be signed in to change notification settings - Fork 512
refactor: set storage_object on open of storages #993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more comment to the whole change. Seems to me that same thing is refereed under different names in different places. It would be nice to unify it and use only one name variant. For example one of following storage_object or storage_info or maybe storage_metadata since it is instance of _StorageMetadata.
src/crawlee/storages/_base.py
Outdated
|
|
||
| @property | ||
| @abstractmethod | ||
| def storage_object(self) -> _StorageMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is public property, that returns private type. I guess we should expose _StorageMetadata if we use it as type hint for public property now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add 3 simple test cases(one for each storage class) that will just call from_storage_object constructor and assert that the created object is as expected.
|
|
||
| @property | ||
| @abstractmethod | ||
| def storage_object(self) -> StorageMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this be more precise?
| def storage_object(self) -> StorageMetadata: | |
| def storage_object(self) -> DatasetMetadata | KeyValueStoreMetadata | RequestQueueMetadata: |
and _StorageMetadata could remain private.
I am also ok with having StorageMetadata as a type alias for the union mentioned above. And the _StorageMetadata could be renamed to _CommonStorageMetadata, _BaseStorageMetadata, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues here, just make sure that the new SDK works with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I'm OK with the common StorageMetadata model for every storage. LGTM.
### Description This PR follows up on #993. While aligning changes with the Apify SDK, I noticed that Pydantic models only select explicitly defined fields. However, we also need access to `urlSigningSecretKey`, so I’ve updated the `StorageMetadata` config **to allow extra fields.** ### Testing Added tests for all storage types (datasets, key-value stores, and request queues) to ensure extra attributes are correctly handled. ### Checklist - [x] CI passed
This PR is part of [Issue #19363](apify/apify-core#19363), which updates `KeyValueStore.getPublicUrl(recordKey)` to generate signed links using HMAC. **This PR**: - creates signature and appends it to the public URL of the record in KV store P.S. Before merging, we need to wait for the release of Crawlee, so we can update version of Crawlee. P.P.S It's hard to test the changes, since master branch of SDK and master branch of Crawlee are out of sync. Crawlee has some breaking changes in version 6.0, which are not yet addressed in master branch of SDK. Previous PR that adds storageObject to Crawlee Python is [here](apify/crawlee-python#993) Same PR in SDK JS is [here](apify/apify-sdk-js#358)

Description
Currently, when opening a KV store, dataset or request queue we only store the id and name from the get() response.
This PR updates the logic to store the entire storage_object, allowing access to additional attributes without requiring extra requests.
More context in slack
P.S. Python Crawlee storage constructors are a bit different than JS storage consturctors, that's why I've added storage_object to all of the classes - KV store, dataset & request queue
Similar PR in Crawlee JS
Testing
I've added one test, which checks that KV store class has storage_object after open.