-
-
Notifications
You must be signed in to change notification settings - Fork 62
Description
Yes, the title is a lie.
libzim is already at version 6, we won't rename libzim to libzim2 and zim format may not change.
But we are discussing important changes in the libzim api and how we interpret the zim format.
This plan comes from several needs :
- Remove the namespaces Getting rid of the namespaces #15
- Add category handling for the articles Native handling of categories inside the ZIM file/format #75 Feature request: Add keyword/tag support #317
- Use templates to render the articles (and so only store "interesting" content in the article) Useful template mgmt functions in the libzim #80
- Add a way to associated custom metadata (keywords,...) to an article Feature request: Add keyword/tag support #317 Feature request: Add metadata support #316
The plan is to pave the way for this changes.
Not everything will be made at once, but we should discuss this in the same time.
Removing namespace
Totally removing the namespace from the zim format is a pretty complex things,
it means a total zim file format change, and a rewriting a a big part of the library.
On top of that, removing them leads to new questions (How to store the metadata ?)
So the idea is to keep the namespaces but use them differently :
- Make the namespace totally internal in the implementation.
- Change the API hide the namespace and provide higher level function.
Creator
The creator will have different methods :
- addMetadata (instead of adding article in 'M' namespace)
- addRedirect (instead of adding a redirect article)
- addEntry (instead of adding "classical" article)
It will simplify the user code as it doesn't need to create an article for
metadata and redirect.
The "classical" article will not need to have getNamespace method and the
path will not contains it.
The namespaces will still be used :
'M' and 'X' for metadata and indexes (as before)
All other content will go in the "default" namespace : 'E' (for entry)
Article will loose those methods:
- isRedirect, getRedirectUrl => no more used, we have the addRedirect method
- isLinkTarget => Never used
- isDeleted => Never used
Reader
The reader api has several flaws.
The first one is about the articles.
The File methods to get an article always return an Article instance.
Whatever the article exists or not, or if it is a redirect.
User has to call good() method to check if the article has been found and isRedirect to check if it is a redirect.
This is error prone as user will forget a check at a moment.
On top of that, Article are light weight objects. While it is a nice feature (We can copy the article),
it also means that we cannot store things in the article (as url, title, namespace or blob)
The File object will have several new methods:
string getMetadata(string name) throw NotFoundto get the content associated to the metadata namedname.
If the metadata is not found, an exception is raised.Entry getHandleBy*() throw NotFoundto get a handle on an entry (article or redirect) base on the url or title.
If the handle is not found, an exception is raised.
The handle itself will not contain a lot but will have methods:
bool isRedirect()to know if the handle is a redirect or not.string getPath()andstring getTitle()as both article and redirect have a path and a title.shared_ptr<Entry> getEntry(bool follow) throw IsRedirectto get an entry if the handle is not a redirect.
If the handle is a redirect and follow is false, throw an exception.
If the handle is a redirect and follow is true, identical togetRedirect.shared_ptr<Entry> getRedirect() throw NotFoundto get the entry to which to redirect point.
If the redirect point to a non existing article, NotFound is raised.
So, it would be impossible to get an instance of a invalid Handle or Entry.
If the user get a Entry, it can use it without further check.
Removed methods
Article File::getArticle(char ns, const std::string& url) constArticle File::getArticleByTitle(char ns, const std::string& title) constarticle_index_type File::getNamespaceBeginOffset(char ch) constarticle_index_type File::getNamespaceEndOffset(char ch) constarticle_index_type File::getNamespaceCount(char ns) conststd::string File::getNamespaces() constbool File::hasNamespace(char ch) constconst_iterator File::findByTitle(char ns, const std::string& title) constconst_iterator File::find(char ns, const std::string& url) constchar Article::getNamespace() const
It worth to mention than all:
Articlewill be renamed toEntry: Image, js or css are not articles.Urlwill be renamed toPath.
Compatibility
Backward compatibility (New lib reading old zim)
Old zim file can be detected because of the existence of A namespace
The path exposed to the user (as article's path) will contains the path (previously longUrl)
This is needed for tools like zimdump who need to write the article in correct subdirectories to preserve relative links.
When accessing a article from a path :
- Path without namespace on old zim => search article in
Anamespace - Path without namespace on new zim => No change
- Path with namespace on old zim => use the namespace in the path (to be able to access image in
I) - Path with namespace on new zim (old bookmark) => remove namespace and search in
Cnamespace.
Forward compatibility (Old lib reading new zim)
The namespace of the article will be presented to the "user".
If the application do not give any meaning to the namespace, there should be no problem.
Links stored in the article are relative and will stay valid.
User will see the C namespace instead of A/I namespace.
The main issue would be about bookmark when user update zim file without updating software.
The software would not be able to search the article in C namespace instead of A.
Small Writer Changes
While it will not be used internally for now, I plan to add some new method on the (writing) entry to have some more useful information :
getEntryHint, that will return a hint to the creator about the entry, for example but not limited to :- If it is a main article
- If the article is a "chrome" entry (css, js, ...).
Then we could regroup those entries in the same cluster has they will probably be used together) - If the entry is about the first page (html but also css, js).
Same here, we could regroup this entries in the same (uncompressed) cluster to have nothing to decompress to display the main page.
getIndexingContent. For now, we are indexing the same content as the displayed content (minus the html tag).
This is not always efficient. (We may not indexing the sources/reference/external links of wikipedia article. Or not the "related" questions in stackoverflow...)
But we cannot do this on libzim side. The "selection" of the text must be done on the scrapper who know the context.
This would potentially allow use to index other content than html (such as image or video) as we dissociate the content of what is indexed.
It would be possible that shouldCompress and shouldIndex will be removed in favor of getEntryHint and getIndexingContent
Category handling
With the change on the namespace there is important change to mention : The namespaces have a meaning internal to the libzim.
Before that, namespace already have meaning but it was done at user code level (kiwix-lib).
So we can add specific feature inside libzim itself.
The first one is category handling.
Categories would be stored in the C namespace. The path of the category would represent its name and the content would be a binary content listing the index of the article in the category.
It is possible to add extra parameter to dirent in libzim. We don't use it for now but it is time now.
The entry's dirent would have an extra parameter the index of its category.
The (writing) entry would have one more method to implement : getCategory() returning the name (the path) of the category of the article.
The (reading) file would have new methods to get the list of the category, or a iterator to the entries in the category.
The entry would have a new method getCategory that return the category name of the entry.
[Question] Should we allow category on redirect ?
[Question] Should we allow an entry to be in several categories ?
Entry template
The same way we store categories in Cnamespace we can store templates in T namespace.
Each entry would store the index of its layout as extra parameter (as for category).
Templates would be mustache templates and could access to the entry's information (path, title, content, category and extradata) and the zim's metadata (name, "host", ...)
The method getContent of the entry would return the rendered content. (If no template is set do as getRawContent)
The method getRawContent would return the content stored in the entry without doing template rendering.
Entry extra data
While it would be possible to add extra information on a entry using extra parameter, I prefer to keep the extra parameter to things pretty short and "technical".
It would be also possible to use another category to store the extra data for an entry but it would double the number of dirent (and copy the path).
The solution I propose is to store the extra data in a blob and, as extra parameter, add the cluster/blob number in the direct for the extradata.
So, it will be two blobs per entry (at least entries with extradata).
The extradata itself would use the MessagePack format (https://msgpack.org/) and the top level object would be a map.
The (writing) entry will have a new method getExtraDatas returning a std::map for the extradata.
The (reading) entry will have a new method getExtraDataAs*(std::string name) to get the value of a particular extradata.
[Question] Should we allow extradata on redirect ?
Conclusion
This is a huge changes.
However, the namespace change is mostly an API change without real internal code change.
But it clearly introduce API break and it should be discuss with users of libzim.
It somehow introduce an API break if some reader/implementation assume that article are always in A namespace.
Project using libzim should be adapted (kiwix-lib, zim-tools, zimwriterfs, node and python wrappers)
All other improvement are a bit more complex to do but can be done separately later. They are based on the new way to handle namespaces and I think will should agree on them, at least the outlines.
Please give feedbacks for this proposition.