Big architecture rework #994

oliverklee · 2021-04-10T09:45:55Z

In addition to the small architectural problems (CssInliner is too big and does too much, arrays and stdClass instead of dedicated classes), I currently see two big problems I'd like to solve:

The HtmlProcessor family of classes (including CssInliner) currently mix two responsibilities:
1.1. changing the HTML (which they should keep doing)
1.2. parsing and storing the HTML (which I'd like to move to a dedicated class (or family of classes)
There currently is no way Emogrifier classes can be dependency-injected (e.g., by the Symfony container), which also means that there is no way to mock them in unit tests. We probably need a class that can emogrify multiple HTML documents per instance to make this usable in real-world scenarios.

The first set of problems is solvable without any (major) breaking changes, I think: We need to move stuff out to other classes. This also makes solving the second big problem easier to solve as the code then is easier understand and modify (and the risk of breaking things is reduced if we have smaller classes with clear contracts).

Maybe we can have a (new) Emogrifier (naming still TBD) class that wraps CssInliner etc. using the Facade design pattern. We still need to determine a new, clean interface on how the new class should handle multiple processors while still being able to process multiple HTML documents. Maybe we can have another intermediate class returns by the new Emogrifier methods that holds a reference to the artifacts for a single HTML document.

@JakeQZ, @jjriv, @zoliszabo What do you think?

The text was updated successfully, but these errors were encountered:

JakeQZ · 2021-04-10T22:57:00Z

I'll look into the Facade design pattern and have a think...

On the topic of CssInliner being too big, this was partly touched upon in #912. The suggestion there is that we can move the CSS parsing to a separate class, which will ultimately become just a wrapper for the third-party library (and may then be removed or 'unwrapped'). I am thinking of having a class named Css (or ParsedCss, i.e. based on what its objects represent, rather than what it does, so not CssParser), whose constructor takes a string of CSS, and which has methods to retrieve the parsed content. Initially, the main method would be (say) getRuleBlocks returning arrays in the form used internally by CssInliner (i.e. elements with media, selector and declarationsBlock, etc.). There would probably need to be one or two other methods to retrieve things like @charset and @import which are passed on by CssInliner but have no property declarations to inline.

JakeQZ · 2021-04-13T13:36:48Z

I think it certainly makes sense as a first step to move the DOMDocument stuff out of AbstractHtmlProcessor and into a new class (say HtmlDocument), an instance of which would be a property of AbstractHtmlProcessor (at least for the time being). The relevant methods of AbstractHtmlProcessor (including the fromXyz creation methods) would then chain on to their equivalents in HtmlDocument.

Ultimately I think AbstractHtmlProcessor shouldn't be storing an HtmlDocument (permanently) as a property, but rather have a method processHtmlDocument which takes the HtmlDocument to work on as a parameter (perhaps also defined as an interface HtmlProcessor). This could perhaps be stored temporarily (e.g. as $currentHtmlDocument) if there are several methods that need access to it, to avoid having to pass it around internally, though the existence of such a property might be specific to the implementing class.

For backwards-compatibility, there could be a 'default' HtmlDocument that could be set (i.e. the property from the initial first step above is retained for backwards-compatibility, but with the 'new' way of doing things, this would always be null).

In the case of CssInliner, for example, its properties should be the CSS, and any other parameters that control the way the CSS is inlined. Likewise, the other classes would have properties as required that control the way their processHtmlDocument behaves.

Thus, for CssInliner we might, for backwards compatibility, have inlineCss implemented as follows:

    public function inlineCss(string $css = ''): self
    {
        if (!isset($this->defaultHtmlDocument)) {
            throw new \BadMethodCallException('A default HTML document has not been set');
        }
        $this->setCss($css);
        $this->processHtmlDocument($this->defaultHtmlDocument);
        return $this;
    }

...where processHtmlDocument now does what inlineCss used to do, using the CSS that has been set as a property.

I haven't thought through how this would work for the classes that currently have multiple processing methods. An advantage of a single processing method following a defined interface is allowing HTML processors to be chained into an 'array' - which could have a class, also implementing the same interface, to manage the chaining.

A likely use case would be to set up a single HTML processor (e.g. either a CssInliner or an array of HTML processors including a CssInliner) to process multiple HTML documents. E.g. some kind mail merge sending multiple emails with different content but using the same styling. It will be more efficient for CssInliner to have processed the CSS just once, ready to apply it to each HTML document that is thrown at it.

Some kind of wrapper (or facade?) might involve the creation of the/each HtmlDocument from the raw HTML, running it through the processor, then obtaining the raw processed HTML with render or renderBodyContent.

JakeQZ · 2021-04-13T15:15:51Z

Some kind of wrapper (or facade?) might involve the creation of the/each HtmlDocument from the raw HTML, running it through the processor, then obtaining the raw processed HTML with render or renderBodyContent.

HtmlDocument could have a method

    public function processWith(HtmlProcessor $processor): self
    {
        $processor->processHtmlDocument($this);
        return $this;
    }

which would allow method chaining such as

    $visualHtml = HtmlDocument::fromHtml($html)->processWith(new CssInliner($css))->render();

or

    $visualHtml = HtmlDocument::fromHtml($html)
        ->processWith(new HtmlProcessorArray([
            new CssInliner($css),
            new HtmlPruner(),
            new CssToAttributeConverter(),
        ]))
        ->render();

Also remove a redundant cache for the parsed CSS - the CSS is only ever parsed once per `CssInliner` instance. Part of #734 and #544. Relates to #994.

Part of #734 and #544. Relates to #994.

oliverklee · 2021-04-16T16:20:53Z

To make the initial Emogrifier-provided class injectable as a service (and hence mockable in tests), the wrapper could provide an API like this:

// `$emogrifier` can also get injected instead of getting instantiated.
$emogrifier = new Emogrifier();

// minimal example without additional CSS
$result = $emogrifier->inlineCss($html)->render();

// only render the body; available at all places
$result = $emogrifier->inlineCss($html)->renderBody();

// minimal example with additional CSS
$result = $emogrifier->inlineCss($html, $css)->render();

// big example with multiple HTML processors
$result = $emogrifier->inlineCss($html, $css)->disabledInlineStyleAttribuesParsing()->someOtherCssInlinerMethod()
  ->removeClassAttributeFromElements()
  ->someOtherMethodFromHtmlPruner()
  ->render();

// Then do the same with some other HTML.
$result2 = $emogrifier->inlineCss($html2, $css)->render();

// Split up across multiple variables
$intermediateResult = $emogrifier->inlineCss($html, $css)->disabledInlineStyleAttribuesParsing()->someOtherCssInlinerMethod()
  ->removeClassAttributeFromElements()
$result = $intermediateResult->someOtherMethodFromHtmlPruner()
  ->render();

(I'm thinking of some clever way of passing around results and executing the HTML processors lazily.)

What do you think, @JakeQZ?

Part of #734 and #544. Relates to #994.

JakeQZ · 2021-05-01T18:28:35Z

To make the initial Emogrifier-provided class injectable as a service (and hence mockable in tests), the wrapper could provide an API like this:

I think this wouldn't be mutually exclusive with other ideas in this thread.

Presubly these methods (apart from render) would return $this;.

I imagine internally the Emogrifier class would have properties which are the current HTML document, the current CSS, and any other settings which control the behaviour of the HTML processors. These might be indirect properties which belong to the HTML processors, with Emogrifier having the containing HTML processors as properties.

// Then do the same with some other HTML.
$result2 = $emogrifier->inlineCss($html2, $css)->render();

If the CSS was the same as the first time, we wouldn't want to re-parse it (so would perhaps re-use the previous CssInliner instance). Would this 'remember' the other methods that were called in the preceding 'big' example, or would these all need to be repeated?

inlineCss($html, $css)->disabledInlineStyleAttribuesParsing()

This would need lazy processing. I guess the class could internally build up a list of processing operations to perform, and only execute them when render is invoked.

bytestream · 2021-05-24T11:53:20Z

To weigh in from the perspective of #828. I'm not sure how far you want to go with respect to decoupling DOMDocument usage... The biggest benefit of the masterminds/html5 parser is that it's pretty seemless to integrate because it still uses DOMDocument.

If you wanted to fully decouple (OO), then what Jake suggested of using an interface would be the best way to go. That interface would have to implement all the currently used methods like getElementsByTagName(), xpath->query(), saveHtml(). A benefit of going this way I guess is that one could implement the interface themselves and you, as libraries authors, wouldn't have to worry about maintaining that integration.

Happy to help if an approach is agreed upon.

oliverklee · 2021-05-25T08:15:39Z

@bytestream Then please let's start with the interface and the proxy implementing it for the DOM class. Would you be willing to tackle this task?

JakeQZ · 2021-05-25T11:33:00Z

If you wanted to fully decouple (OO), then what Jake suggested of using an interface would be the best way to go.

To clarify, I was suggesting perhaps introducing an HtmlProcessor interface for AbstractHtmlProcessor after everything DOMDocument-related has been moved to another class. What I meant by decoupling was simply the moving of everything DOMDocument-related from AbstractHtmlProcessor to a new class, instances of which would then be provided to AbstractHtmlProcessor for processing.

That interface would have to implement all the currently used methods like getElementsByTagName(), xpath->query(), saveHtml().

Or provide higher-level methods such as getBodyElement which in turn use the DOMDocument methods.

The biggest benefit of the masterminds/html5 parser is that it's pretty seemless to integrate because it still uses DOMDocument.

Actually I don't think it will be practical to conceal the DOMDocument or allow the underlying implementation not to use it. Most methods will be returning DOMElements or DOMNodes, which in turn have references to the 'owner document'.

oliverklee added the discussion label Apr 10, 2021

oliverklee added this to the Backlog milestone Apr 10, 2021

oliverklee self-assigned this Apr 10, 2021

JakeQZ added a commit that referenced this issue Apr 13, 2021

[CLEANUP] Move CSS parsing to a separate class

e1cc71d

Also remove a redundant cache for the parsed CSS - the CSS is only ever parsed once per `CssInliner` instance. Part of #734 and #544. Relates to #994.

JakeQZ mentioned this issue Apr 13, 2021

[CLEANUP] Move CSS parsing to a separate class #1014

Merged

JakeQZ added a commit that referenced this issue Apr 15, 2021

[CLEANUP] Move CSS parsing to a separate class

976e226

Part of #734 and #544. Relates to #994.

JakeQZ added a commit that referenced this issue Apr 15, 2021

[CLEANUP] Move CSS parsing to a separate class

373b6b7

Part of #734 and #544. Relates to #994.

JakeQZ added a commit that referenced this issue Apr 15, 2021

[CLEANUP] Move CSS parsing to a separate class

7ece4ff

Part of #734 and #544. Relates to #994.

JakeQZ added a commit that referenced this issue Apr 16, 2021

[CLEANUP] Move CSS parsing to a separate class

5881abc

Part of #734 and #544. Relates to #994.

JakeQZ added a commit that referenced this issue Apr 16, 2021

[CLEANUP] Move CSS parsing to a separate class

cfd10c5

Part of #734 and #544. Relates to #994.

JakeQZ added a commit that referenced this issue Apr 17, 2021

[CLEANUP] Move CSS parsing to a separate class

e3e33e6

Part of #734 and #544. Relates to #994.

oliverklee mentioned this issue Apr 17, 2021

Pass CssInliner as a dependency #825

Open

JakeQZ added a commit that referenced this issue Apr 17, 2021

[CLEANUP] Move CSS parsing to a separate class

bda24ac

Part of #734 and #544. Relates to #994.

oliverklee pushed a commit that referenced this issue Apr 18, 2021

[CLEANUP] Move CSS parsing to a separate class (#1014)

c8f8950

Part of #734 and #544. Relates to #994.

JakeQZ mentioned this issue Apr 28, 2021

HTML5 parsed as HTML4 #1051

Open

JakeQZ mentioned this issue Jun 13, 2021

[CLEANUP] Refactor into a new StyleRule class #1082

Merged

JakeQZ mentioned this issue Dec 16, 2021

Add more selector edge case tests #1135

Open

JakeQZ mentioned this issue Oct 13, 2022

Investigate the tbela99/css parser #1159

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big architecture rework #994

Big architecture rework #994

oliverklee commented Apr 10, 2021

JakeQZ commented Apr 10, 2021 •

edited

Loading

JakeQZ commented Apr 13, 2021

JakeQZ commented Apr 13, 2021

oliverklee commented Apr 16, 2021 •

edited

Loading

JakeQZ commented May 1, 2021

bytestream commented May 24, 2021

oliverklee commented May 25, 2021

JakeQZ commented May 25, 2021 •

edited

Loading

Big architecture rework #994

Big architecture rework #994

Comments

oliverklee commented Apr 10, 2021

JakeQZ commented Apr 10, 2021 • edited Loading

JakeQZ commented Apr 13, 2021

JakeQZ commented Apr 13, 2021

oliverklee commented Apr 16, 2021 • edited Loading

JakeQZ commented May 1, 2021

bytestream commented May 24, 2021

oliverklee commented May 25, 2021

JakeQZ commented May 25, 2021 • edited Loading

JakeQZ commented Apr 10, 2021 •

edited

Loading

oliverklee commented Apr 16, 2021 •

edited

Loading

JakeQZ commented May 25, 2021 •

edited

Loading