Skip to content

Plan to libzim2 #325

@mgautierfr

Description

@mgautierfr

Yes, the title is a lie.
libzim is already at version 6, we won't rename libzim to libzim2 and zim format may not change.
But we are discussing important changes in the libzim api and how we interpret the zim format.


This plan comes from several needs :

The plan is to pave the way for this changes.
Not everything will be made at once, but we should discuss this in the same time.

Removing namespace

Totally removing the namespace from the zim format is a pretty complex things,
it means a total zim file format change, and a rewriting a a big part of the library.
On top of that, removing them leads to new questions (How to store the metadata ?)

So the idea is to keep the namespaces but use them differently :

  • Make the namespace totally internal in the implementation.
  • Change the API hide the namespace and provide higher level function.

Creator

The creator will have different methods :

  • addMetadata (instead of adding article in 'M' namespace)
  • addRedirect (instead of adding a redirect article)
  • addEntry (instead of adding "classical" article)

It will simplify the user code as it doesn't need to create an article for
metadata and redirect.
The "classical" article will not need to have getNamespace method and the
path will not contains it.

The namespaces will still be used :

'M' and 'X' for metadata and indexes (as before)

All other content will go in the "default" namespace : 'E' (for entry)

Article will loose those methods:

  • isRedirect, getRedirectUrl => no more used, we have the addRedirect method
  • isLinkTarget => Never used
  • isDeleted => Never used

Reader

The reader api has several flaws.
The first one is about the articles.

The File methods to get an article always return an Article instance.
Whatever the article exists or not, or if it is a redirect.
User has to call good() method to check if the article has been found and isRedirect to check if it is a redirect.
This is error prone as user will forget a check at a moment.

On top of that, Article are light weight objects. While it is a nice feature (We can copy the article),
it also means that we cannot store things in the article (as url, title, namespace or blob)

The File object will have several new methods:

  • string getMetadata(string name) throw NotFound to get the content associated to the metadata named name.
    If the metadata is not found, an exception is raised.
  • Entry getHandleBy*() throw NotFound to get a handle on an entry (article or redirect) base on the url or title.
    If the handle is not found, an exception is raised.

The handle itself will not contain a lot but will have methods:

  • bool isRedirect() to know if the handle is a redirect or not.
  • string getPath() and string getTitle() as both article and redirect have a path and a title.
  • shared_ptr<Entry> getEntry(bool follow) throw IsRedirect to get an entry if the handle is not a redirect.
    If the handle is a redirect and follow is false, throw an exception.
    If the handle is a redirect and follow is true, identical to getRedirect.
  • shared_ptr<Entry> getRedirect() throw NotFound to get the entry to which to redirect point.
    If the redirect point to a non existing article, NotFound is raised.

So, it would be impossible to get an instance of a invalid Handle or Entry.
If the user get a Entry, it can use it without further check.

Removed methods

  • Article File::getArticle(char ns, const std::string& url) const
  • Article File::getArticleByTitle(char ns, const std::string& title) const
  • article_index_type File::getNamespaceBeginOffset(char ch) const
  • article_index_type File::getNamespaceEndOffset(char ch) const
  • article_index_type File::getNamespaceCount(char ns) const
  • std::string File::getNamespaces() const
  • bool File::hasNamespace(char ch) const
  • const_iterator File::findByTitle(char ns, const std::string& title) const
  • const_iterator File::find(char ns, const std::string& url) const
  • char Article::getNamespace() const

It worth to mention than all:

  • Article will be renamed to Entry : Image, js or css are not articles.
  • Url will be renamed to Path.

Compatibility

Backward compatibility (New lib reading old zim)

Old zim file can be detected because of the existence of A namespace

The path exposed to the user (as article's path) will contains the path (previously longUrl)
This is needed for tools like zimdump who need to write the article in correct subdirectories to preserve relative links.

When accessing a article from a path :

  • Path without namespace on old zim => search article in A namespace
  • Path without namespace on new zim => No change
  • Path with namespace on old zim => use the namespace in the path (to be able to access image in I)
  • Path with namespace on new zim (old bookmark) => remove namespace and search in C namespace.

Forward compatibility (Old lib reading new zim)

The namespace of the article will be presented to the "user".
If the application do not give any meaning to the namespace, there should be no problem.
Links stored in the article are relative and will stay valid.
User will see the C namespace instead of A/I namespace.
The main issue would be about bookmark when user update zim file without updating software.
The software would not be able to search the article in C namespace instead of A.

Small Writer Changes

While it will not be used internally for now, I plan to add some new method on the (writing) entry to have some more useful information :

  • getEntryHint, that will return a hint to the creator about the entry, for example but not limited to :
    • If it is a main article
    • If the article is a "chrome" entry (css, js, ...).
      Then we could regroup those entries in the same cluster has they will probably be used together)
    • If the entry is about the first page (html but also css, js).
      Same here, we could regroup this entries in the same (uncompressed) cluster to have nothing to decompress to display the main page.
  • getIndexingContent. For now, we are indexing the same content as the displayed content (minus the html tag).
    This is not always efficient. (We may not indexing the sources/reference/external links of wikipedia article. Or not the "related" questions in stackoverflow...)
    But we cannot do this on libzim side. The "selection" of the text must be done on the scrapper who know the context.
    This would potentially allow use to index other content than html (such as image or video) as we dissociate the content of what is indexed.

It would be possible that shouldCompress and shouldIndex will be removed in favor of getEntryHint and getIndexingContent

Category handling

With the change on the namespace there is important change to mention : The namespaces have a meaning internal to the libzim.
Before that, namespace already have meaning but it was done at user code level (kiwix-lib).
So we can add specific feature inside libzim itself.

The first one is category handling.

Categories would be stored in the C namespace. The path of the category would represent its name and the content would be a binary content listing the index of the article in the category.
It is possible to add extra parameter to dirent in libzim. We don't use it for now but it is time now.
The entry's dirent would have an extra parameter the index of its category.

The (writing) entry would have one more method to implement : getCategory() returning the name (the path) of the category of the article.
The (reading) file would have new methods to get the list of the category, or a iterator to the entries in the category.

The entry would have a new method getCategory that return the category name of the entry.

[Question] Should we allow category on redirect ?
[Question] Should we allow an entry to be in several categories ?

Entry template

The same way we store categories in Cnamespace we can store templates in T namespace.
Each entry would store the index of its layout as extra parameter (as for category).

Templates would be mustache templates and could access to the entry's information (path, title, content, category and extradata) and the zim's metadata (name, "host", ...)

The method getContent of the entry would return the rendered content. (If no template is set do as getRawContent)
The method getRawContent would return the content stored in the entry without doing template rendering.

Entry extra data

While it would be possible to add extra information on a entry using extra parameter, I prefer to keep the extra parameter to things pretty short and "technical".
It would be also possible to use another category to store the extra data for an entry but it would double the number of dirent (and copy the path).

The solution I propose is to store the extra data in a blob and, as extra parameter, add the cluster/blob number in the direct for the extradata.
So, it will be two blobs per entry (at least entries with extradata).

The extradata itself would use the MessagePack format (https://msgpack.org/) and the top level object would be a map.

The (writing) entry will have a new method getExtraDatas returning a std::map for the extradata.
The (reading) entry will have a new method getExtraDataAs*(std::string name) to get the value of a particular extradata.

[Question] Should we allow extradata on redirect ?

Conclusion

This is a huge changes.
However, the namespace change is mostly an API change without real internal code change.
But it clearly introduce API break and it should be discuss with users of libzim.
It somehow introduce an API break if some reader/implementation assume that article are always in A namespace.
Project using libzim should be adapted (kiwix-lib, zim-tools, zimwriterfs, node and python wrappers)

All other improvement are a bit more complex to do but can be done separately later. They are based on the new way to handle namespaces and I think will should agree on them, at least the outlines.

Please give feedbacks for this proposition.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions