Skip to content

Significantly enhance the safety of metadata manipulation #217

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Nov 15, 2024

Fix #205

Changes

  • make config_metadata much stricter around standard metadata
    • add new StandardMetadata class for standard metadata
    • add way more typing around metadata manipulation with RawMetadataValue, CleanMetadataValue and Metadata
  • add config_any_metadata for those who need to add custom ones
  • systematically cleanup / convert / check all metadata values for our constraints and conventions
  • ensure that add_metadata does same cleanup / checks as config_metadata and config_any_metadata
  • move constants and validation logic for metadata to zim/metadata.py
    • from creator.py: MANDATORY_ZIM_METADATA_KEYS, DEFAULT_DEV_ZIM_METADATA, RECOMMENDED_MAX_TITLE_LENGTH, MAXIMUM_DESCRIPTION_METADATA_LENGTH, MAXIMUM_LONG_DESCRIPTION_METADATA_LENGTH, ILLUSTRATIONS_METADATA_RE
    • from creator.py: convert_and_check_metadata

@benoit74 benoit74 self-assigned this Nov 15, 2024
Copy link

codecov bot commented Nov 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (7d1b958) to head (0e764ed).

Additional details and impacted files
@@            Coverage Diff             @@
##           data_src      #217   +/-   ##
==========================================
  Coverage    100.00%   100.00%           
==========================================
  Files            38        38           
  Lines          2224      2278   +54     
  Branches        426       435    +9     
==========================================
+ Hits           2224      2278   +54     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@benoit74 benoit74 requested a review from rgaudin November 15, 2024 15:42
- move constants and validation logic from creator.py and constants.py
  to zim/metadata.py
- add way more typing around metadata manipulation
- add new class for standard metadata, and alias for value
  differentiating raw and clean metadata
- make config_metadata much stricter around standard metadata
- add config_any_metadata for those need to add custom ones or wanting
  to tinker with standard ones
@benoit74 benoit74 marked this pull request as ready for review November 15, 2024 15:44
Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this ; and thank you for the extensive tests.

I like the approach but I'm not a fan of extending the API with config_any_metadata(). I dont have much argument beside it being another piece of API to maintain and the relative confusion it introduces: one has to dig into docstring to understand what to use.

I also wonder if we need all this flexibility: Metadata, list of metadata, dict of key/value pairs ?

Overall it's great

r"^Illustration_(?P<height>\d+)x(?P<width>\d+)@(?P<scale>\d+)$"
)

RawMetadataValue = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naming is misleading. It's not a raw metadata value, it's a list of allowed types for metadata values.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would you recommend? The fact that it is a type and not a value itself should not be reflected in the name (we do not name all types ...Type). Would it be better to use ScraperMetadataValue here instead of RawMetadataValue, and LibzimMetadataValue (or just bytes) instead of CleanMetadataValue? I think it is important to indicate this type is only for the metadata value, and not the whole metadata (name + value). We could as well use Content instead of Value if it helps.

None | float | int | bytes | str | datetime.datetime | datetime.date | Iterable[str]
)

CleanMetadataValue = bytes | str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idem

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we drop the str here? Has this is supposed to be the output of the auto convert, we could always provide the native type which is bytes. Saves one conversion on pylibzim.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using only bytes is going to simplify stuff at some places, but be more complex at other places. The difficulty I see is that when we validate again metadata at creator start (but also at other places), we check that all "text" metadata are indeed str. If they have already been converted to bytes, we cannot check this anymore. But we can check it is possible to decode them in UTF-8, which in fact is more correct. One could want to add all "text" metadata already as encoded UTF-8 bytes ... and currently the scraperlib does not allows it despite this being correct from specification standpoint. It means a bit more decoding at some places (all validate_xxx methods basically), but it is not like these methods are intensively called, so we probably do not mind the extra CPU cycles.

I proposed to move forward for clarity and exactitude. OK for you?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -145,26 +144,42 @@ def __init__(
self.ignore_duplicates = ignore_duplicates
self.disable_metadata_checks = disable_metadata_checks

T = TypeVar("T", str, bytes)

def get_metadata(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use for this? That seems superflous from a Creator user perspective. Could be an out-of-API function for tests though.

if isinstance(metadata, Metadata):
metadata = [metadata]
if isinstance(metadata, dict):
if {type(name) for name in metadata.keys()} == {str}:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the following a lot easier to understand

Suggested change
if {type(name) for name in metadata.keys()} == {str}:
if all(type(name) == str for name in metadata.keys()):


language = self._metadata.get("Language", "").split(",")
language = self.get_metadata("Language", str, "").split(",")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be handled earlier?


# probably not needed if only our methods have been used, but who knows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's still possible to manipulate the .metadata dict directly so it's good to keep

@@ -20,39 +18,3 @@

# list of mimetypes we consider articles using it should default to FRONT_ARTICLE
FRONT_ARTICLE_MIMETYPES = ["text/html"]

# list of mandatory meta tags of the zim file.
MANDATORY_ZIM_METADATA_KEYS = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is wise. I understand you want to keep metadata-related stuff together but the point of the constants module is for the developer to act as a config file. If some are spread into different places, then the existence of this file is in jeopardy.
Either we think it makes sense as it previously was or we harmonize and move other constants out as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it

@benoit74
Copy link
Collaborator Author

I like the approach but I'm not a fan of extending the API with config_any_metadata(). I dont have much argument beside it being another piece of API to maintain and the relative confusion it introduces: one has to dig into docstring to understand what to use.

Indeed, it is "a shame" to have two APIs where one with two arguments would suffice. I already simplified a lot compared to original code I wrote but I have to admit I could go even one step further.

I also wonder if we need all this flexibility: Metadata, list of metadata, dict of key/value pairs ?

I'm a bit uncertain on this. dict of key/value pairs is nice for its brevity ... but it's poorly typed, and the past has shown that typed objects might help. At the same time, here we really have no idea of what these metadata can be, so a dict is probably sufficient and the Metadata object does not bring much clarity.

@rgaudin
Copy link
Member

rgaudin commented Nov 19, 2024

I'm a bit uncertain on this. dict of key/value pairs is nice for its brevity ... but it's poorly typed, and the past has shown that typed objects might help. At the same time, here we really have no idea of what these metadata can be, so a dict is probably sufficient and the Metadata object does not bring much clarity.

Yes, I want to say it's easier to extend later when there's a need than to maintain potential use cases but it's really up to you. It's not not a big deal either way

Base automatically changed from data_src to main November 19, 2024 08:49
@benoit74
Copy link
Collaborator Author

Superseeded by #221

@benoit74 benoit74 closed this Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[next major] remove **extra from Creator.config_metadata
2 participants