-
-
Notifications
You must be signed in to change notification settings - Fork 19
Significantly enhance the safety of metadata manipulation #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## data_src #217 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 38 38
Lines 2224 2278 +54
Branches 426 435 +9
==========================================
+ Hits 2224 2278 +54 ☔ View full report in Codecov by Sentry. |
- move constants and validation logic from creator.py and constants.py to zim/metadata.py - add way more typing around metadata manipulation - add new class for standard metadata, and alias for value differentiating raw and clean metadata - make config_metadata much stricter around standard metadata - add config_any_metadata for those need to add custom ones or wanting to tinker with standard ones
b718c06
to
0e764ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this ; and thank you for the extensive tests.
I like the approach but I'm not a fan of extending the API with config_any_metadata()
. I dont have much argument beside it being another piece of API to maintain and the relative confusion it introduces: one has to dig into docstring to understand what to use.
I also wonder if we need all this flexibility: Metadata, list of metadata, dict of key/value pairs ?
Overall it's great
src/zimscraperlib/zim/metadata.py
Outdated
r"^Illustration_(?P<height>\d+)x(?P<width>\d+)@(?P<scale>\d+)$" | ||
) | ||
|
||
RawMetadataValue = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naming is misleading. It's not a raw metadata value, it's a list of allowed types for metadata values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would you recommend? The fact that it is a type and not a value itself should not be reflected in the name (we do not name all types ...Type
). Would it be better to use ScraperMetadataValue
here instead of RawMetadataValue
, and LibzimMetadataValue
(or just bytes
) instead of CleanMetadataValue
? I think it is important to indicate this type is only for the metadata value, and not the whole metadata (name + value). We could as well use Content
instead of Value
if it helps.
src/zimscraperlib/zim/metadata.py
Outdated
None | float | int | bytes | str | datetime.datetime | datetime.date | Iterable[str] | ||
) | ||
|
||
CleanMetadataValue = bytes | str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we drop the str
here? Has this is supposed to be the output of the auto convert, we could always provide the native type which is bytes
. Saves one conversion on pylibzim.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using only bytes
is going to simplify stuff at some places, but be more complex at other places. The difficulty I see is that when we validate again metadata at creator start (but also at other places), we check that all "text" metadata are indeed str
. If they have already been converted to bytes
, we cannot check this anymore. But we can check it is possible to decode them in UTF-8, which in fact is more correct. One could want to add all "text" metadata already as encoded UTF-8 bytes ... and currently the scraperlib does not allows it despite this being correct from specification standpoint. It means a bit more decoding at some places (all validate_xxx
methods basically), but it is not like these methods are intensively called, so we probably do not mind the extra CPU cycles.
I proposed to move forward for clarity and exactitude. OK for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
src/zimscraperlib/zim/creator.py
Outdated
@@ -145,26 +144,42 @@ def __init__( | |||
self.ignore_duplicates = ignore_duplicates | |||
self.disable_metadata_checks = disable_metadata_checks | |||
|
|||
T = TypeVar("T", str, bytes) | |||
|
|||
def get_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use for this? That seems superflous from a Creator user perspective. Could be an out-of-API function for tests though.
src/zimscraperlib/zim/creator.py
Outdated
if isinstance(metadata, Metadata): | ||
metadata = [metadata] | ||
if isinstance(metadata, dict): | ||
if {type(name) for name in metadata.keys()} == {str}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the following a lot easier to understand
if {type(name) for name in metadata.keys()} == {str}: | |
if all(type(name) == str for name in metadata.keys()): |
src/zimscraperlib/zim/creator.py
Outdated
|
||
language = self._metadata.get("Language", "").split(",") | ||
language = self.get_metadata("Language", str, "").split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be handled earlier?
src/zimscraperlib/zim/creator.py
Outdated
|
||
# probably not needed if only our methods have been used, but who knows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it's still possible to manipulate the .metadata
dict directly so it's good to keep
@@ -20,39 +18,3 @@ | |||
|
|||
# list of mimetypes we consider articles using it should default to FRONT_ARTICLE | |||
FRONT_ARTICLE_MIMETYPES = ["text/html"] | |||
|
|||
# list of mandatory meta tags of the zim file. | |||
MANDATORY_ZIM_METADATA_KEYS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is wise. I understand you want to keep metadata-related stuff together but the point of the constants module is for the developer to act as a config file. If some are spread into different places, then the existence of this file is in jeopardy.
Either we think it makes sense as it previously was or we harmonize and move other constants out as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it
Indeed, it is "a shame" to have two APIs where one with two arguments would suffice. I already simplified a lot compared to original code I wrote but I have to admit I could go even one step further.
I'm a bit uncertain on this. dict of key/value pairs is nice for its brevity ... but it's poorly typed, and the past has shown that typed objects might help. At the same time, here we really have no idea of what these metadata can be, so a dict is probably sufficient and the Metadata object does not bring much clarity. |
Yes, I want to say it's easier to extend later when there's a need than to maintain potential use cases but it's really up to you. It's not not a big deal either way |
1ed91c7
to
9c611e2
Compare
Superseeded by #221 |
Fix #205
Changes
config_metadata
much stricter around standard metadataStandardMetadata
class for standard metadataRawMetadataValue
,CleanMetadataValue
andMetadata
config_any_metadata
for those who need to add custom onesadd_metadata
does same cleanup / checks asconfig_metadata
andconfig_any_metadata
creator.py
:MANDATORY_ZIM_METADATA_KEYS
,DEFAULT_DEV_ZIM_METADATA
,RECOMMENDED_MAX_TITLE_LENGTH
,MAXIMUM_DESCRIPTION_METADATA_LENGTH
,MAXIMUM_LONG_DESCRIPTION_METADATA_LENGTH
,ILLUSTRATIONS_METADATA_RE
creator.py
:convert_and_check_metadata