Description
One of the foundational pieces of this project, the block-based language of Gutenberg has so far been implemented in the system by a set of parsers generated from a PEG grammar describing the language.
Two motivating factors to back this emerged early on in the development cycle: i) a PEG grammar is easy to understand, review and maintain; ii) the parsers generated from it, one in JS and one in PHP, worked out of the box, and — with caveats — this proved adequate even as the project grew: for instance, the parsing stack easily accommodated block nesting.
This intelligibility and quasi-self-documenting character of the PEG grammar are what led to establishing it as the official specification of the language of Gutenberg in the form of its own package. This is an important step for the long-term success of Gutenberg as not just an editor, but a multi-environment ecosystem bound by the same language.
Opportunities
As Gutenberg braces for merging into core, all the tooling surrounding blocks needs to be as robust and versatile as any other part of WordPress. Among other things, this means that webmasters, plugin authors and platform engineers should have the power to hot-swap parsers to optimize for given requirements — whether requirements concern block analysis, hardware constraints, runtime, etc. Past experiments have confirmed that the PEG grammar is easily tweaked to output specific block information and can be optimized if told to ignore other aspects.
Furthermore, we have a duty to perform as well as possible. Our auto-generated parsers are known not to perform as well as other systems, a trade-off we made for clarity. Client-side parser performance has so far been satisfying, but the server-side parser is a known bottleneck and may not be suitable for demanding posts. The server-side issue was mitigated by altogether eschewing the PHP parser when rendering front-end pages in favor of a simpler parse, but the fact is that obtaining a full parse tree of a post on the server should be performant and robust.
Finally, Gutenberg blocks are to reach well beyond WP-Admin and base WordPress installations. Namely, any third-party clients should be able to work with Gutenberg blocks, from the mobile apps to any community tool. This requirement implies that the language of Gutenberg will need to be parsed in diverse environments and runtimes, well beyond the scope of the base Gutenberg project.
Thus, there is the opportunity to establish the language of Gutenberg as a de facto standard for semantic content formatting. Perhaps choosing parsers will be similar to choosing database systems: most site owners will go with the default, such as MySQL, but competing software abounds for those with different needs.
Parser tooling
Closes #6994.
Before we look at parsing itself, we need to be able to develop, validate and compare implementations.
Develop. Any party should be able to understand the specified language. For this, it should be clear that the PEG grammar is intended to be used as a reference. Tasks:
- Offer PEG grammar and its parser as its own
spec-parser
package. Packages: Create newspec-parser
package #7664 - Auto-generate human-readable version of block grammar. Docs: Auto-generate human-readable version of Gutenberg block grammar #6116
- Increase visibility of this reference, perhaps via a standalone documentation page.
Validate. Developers should be provided official tools that can, as best as possible, test for correctness of parsers. Tasks:
- Develop a large and diverse corpus, comprising posts long and short, with many and few blocks, different kinds of content, written in different locales, and with documents full of various kinds of invalid blocks.
- @dmsnell has started this in his repository gutenberg-document-library.
- Provide easy-to-use validators that test against the corpus.
- Add a JEST test to build and compare language-agnostic parsers. Parser: Add a JEST test to build and compare language agnostic parsers #6030
Compare. A parser doesn't just need to work, it likely needs to perform well. To help inform development (and instigate a healthy sense of competition?), it should be possible to compare parser performance side-by-side or multilaterally. Here, performance encompasses both space and time. Tasks:
- Parser: Build system to compare alternative parser implementations. Parser: Build system to compare alternative parser implementations #6831
Ultimately, all tasks for the above paragraphs should converge towards providing:
- A standalone reference Web page providing spec documentation and validation-comparison tools. Tentatively: https://wordpress.github.io/gutenberg/
Related open issues:
- Check support for
/u
flag in installed PCRE library Check support for/u
flag in installed PCRE library #4852 - Verify parser doesn't cause PCRE fault with Subresource Integrity (SRI) Manager plugin and PHP 5.6.36 PCRE infinite recursion segmentation fault after activating gutenberg plugin #8671
Exploring new parsers
The past months have seen fascinating and encouraging developments in the development of competing parsers.
- @Hywan developed a Rust parser that compiles to a number of different targets, and notably to those that core Gutenberg cares about: the server, in the form of a PHP extension; the client, via WebAssembly binary and an ASM.js fallback. Through use of the Hoa compiler, native PHP bindings are also available, including an experimental PCRE parser.
- @dmsnell wrote a hand-coded parser Parser: Propose new hand-coded parser #8083, working with the PHP interpreter to seek optimal performance.
- @pento asked the question of "if we start with the spec grammar, how much can we gain from fine-tuning it?" in Improve Parser Performance Improve Parser Performance #8044.
These experiments have been benchmarked using gutenberg-document-library
, and results have been extremely encouraging.
to do: insert benchmark tables for CPU and memory