Skip to content

Overview of Short-term Parsing Enhancements #8244

Closed
@mcsf

Description

One of the foundational pieces of this project, the block-based language of Gutenberg has so far been implemented in the system by a set of parsers generated from a PEG grammar describing the language.

Two motivating factors to back this emerged early on in the development cycle: i) a PEG grammar is easy to understand, review and maintain; ii) the parsers generated from it, one in JS and one in PHP, worked out of the box, and — with caveats — this proved adequate even as the project grew: for instance, the parsing stack easily accommodated block nesting.

This intelligibility and quasi-self-documenting character of the PEG grammar are what led to establishing it as the official specification of the language of Gutenberg in the form of its own package. This is an important step for the long-term success of Gutenberg as not just an editor, but a multi-environment ecosystem bound by the same language.

Opportunities

As Gutenberg braces for merging into core, all the tooling surrounding blocks needs to be as robust and versatile as any other part of WordPress. Among other things, this means that webmasters, plugin authors and platform engineers should have the power to hot-swap parsers to optimize for given requirements — whether requirements concern block analysis, hardware constraints, runtime, etc. Past experiments have confirmed that the PEG grammar is easily tweaked to output specific block information and can be optimized if told to ignore other aspects.

Furthermore, we have a duty to perform as well as possible. Our auto-generated parsers are known not to perform as well as other systems, a trade-off we made for clarity. Client-side parser performance has so far been satisfying, but the server-side parser is a known bottleneck and may not be suitable for demanding posts. The server-side issue was mitigated by altogether eschewing the PHP parser when rendering front-end pages in favor of a simpler parse, but the fact is that obtaining a full parse tree of a post on the server should be performant and robust.

Finally, Gutenberg blocks are to reach well beyond WP-Admin and base WordPress installations. Namely, any third-party clients should be able to work with Gutenberg blocks, from the mobile apps to any community tool. This requirement implies that the language of Gutenberg will need to be parsed in diverse environments and runtimes, well beyond the scope of the base Gutenberg project.

Thus, there is the opportunity to establish the language of Gutenberg as a de facto standard for semantic content formatting. Perhaps choosing parsers will be similar to choosing database systems: most site owners will go with the default, such as MySQL, but competing software abounds for those with different needs.

Parser tooling

Closes #6994.

Before we look at parsing itself, we need to be able to develop, validate and compare implementations.

Develop. Any party should be able to understand the specified language. For this, it should be clear that the PEG grammar is intended to be used as a reference. Tasks:

Validate. Developers should be provided official tools that can, as best as possible, test for correctness of parsers. Tasks:

Compare. A parser doesn't just need to work, it likely needs to perform well. To help inform development (and instigate a healthy sense of competition?), it should be possible to compare parser performance side-by-side or multilaterally. Here, performance encompasses both space and time. Tasks:

Ultimately, all tasks for the above paragraphs should converge towards providing:

Related open issues:

Exploring new parsers

The past months have seen fascinating and encouraging developments in the development of competing parsers.

  • @Hywan developed a Rust parser that compiles to a number of different targets, and notably to those that core Gutenberg cares about: the server, in the form of a PHP extension; the client, via WebAssembly binary and an ASM.js fallback. Through use of the Hoa compiler, native PHP bindings are also available, including an experimental PCRE parser.
  • @dmsnell wrote a hand-coded parser Parser: Propose new hand-coded parser #8083, working with the PHP interpreter to seek optimal performance.
  • @pento asked the question of "if we start with the spec grammar, how much can we gain from fine-tuning it?" in Improve Parser Performance Improve Parser Performance #8044.

These experiments have been benchmarked using gutenberg-document-library, and results have been extremely encouraging.

to do: insert benchmark tables for CPU and memory

Metadata

Assignees

Labels

FrameworkIssues related to broader framework topics, especially as it relates to javascript[Feature] ParsingRelated to efforts to improving the parsing of a string of data and converting it into a different f[Type] OverviewComprehensive, high level view of an area of focus often with multiple tracking issues[Type] TaskIssues or PRs that have been broken down into an individual action to take

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions