Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big architecture rework #994

Open
oliverklee opened this issue Apr 10, 2021 · 8 comments
Open

Big architecture rework #994

oliverklee opened this issue Apr 10, 2021 · 8 comments
Assignees
Milestone

Comments

@oliverklee
Copy link
Contributor

In addition to the small architectural problems (CssInliner is too big and does too much, arrays and stdClass instead of dedicated classes), I currently see two big problems I'd like to solve:

  1. The HtmlProcessor family of classes (including CssInliner) currently mix two responsibilities:
    1.1. changing the HTML (which they should keep doing)
    1.2. parsing and storing the HTML (which I'd like to move to a dedicated class (or family of classes)
  2. There currently is no way Emogrifier classes can be dependency-injected (e.g., by the Symfony container), which also means that there is no way to mock them in unit tests. We probably need a class that can emogrify multiple HTML documents per instance to make this usable in real-world scenarios.

The first set of problems is solvable without any (major) breaking changes, I think: We need to move stuff out to other classes. This also makes solving the second big problem easier to solve as the code then is easier understand and modify (and the risk of breaking things is reduced if we have smaller classes with clear contracts).

Maybe we can have a (new) Emogrifier (naming still TBD) class that wraps CssInliner etc. using the Facade design pattern. We still need to determine a new, clean interface on how the new class should handle multiple processors while still being able to process multiple HTML documents. Maybe we can have another intermediate class returns by the new Emogrifier methods that holds a reference to the artifacts for a single HTML document.

@JakeQZ, @jjriv, @zoliszabo What do you think?

@oliverklee oliverklee added this to the Backlog milestone Apr 10, 2021
@oliverklee oliverklee self-assigned this Apr 10, 2021
@JakeQZ
Copy link
Contributor

JakeQZ commented Apr 10, 2021

I'll look into the Facade design pattern and have a think...

On the topic of CssInliner being too big, this was partly touched upon in #912. The suggestion there is that we can move the CSS parsing to a separate class, which will ultimately become just a wrapper for the third-party library (and may then be removed or 'unwrapped'). I am thinking of having a class named Css (or ParsedCss, i.e. based on what its objects represent, rather than what it does, so not CssParser), whose constructor takes a string of CSS, and which has methods to retrieve the parsed content. Initially, the main method would be (say) getRuleBlocks returning arrays in the form used internally by CssInliner (i.e. elements with media, selector and declarationsBlock, etc.). There would probably need to be one or two other methods to retrieve things like @charset and @import which are passed on by CssInliner but have no property declarations to inline.

@JakeQZ
Copy link
Contributor

JakeQZ commented Apr 13, 2021

I think it certainly makes sense as a first step to move the DOMDocument stuff out of AbstractHtmlProcessor and into a new class (say HtmlDocument), an instance of which would be a property of AbstractHtmlProcessor (at least for the time being). The relevant methods of AbstractHtmlProcessor (including the fromXyz creation methods) would then chain on to their equivalents in HtmlDocument.

Ultimately I think AbstractHtmlProcessor shouldn't be storing an HtmlDocument (permanently) as a property, but rather have a method processHtmlDocument which takes the HtmlDocument to work on as a parameter (perhaps also defined as an interface HtmlProcessor). This could perhaps be stored temporarily (e.g. as $currentHtmlDocument) if there are several methods that need access to it, to avoid having to pass it around internally, though the existence of such a property might be specific to the implementing class.

For backwards-compatibility, there could be a 'default' HtmlDocument that could be set (i.e. the property from the initial first step above is retained for backwards-compatibility, but with the 'new' way of doing things, this would always be null).

In the case of CssInliner, for example, its properties should be the CSS, and any other parameters that control the way the CSS is inlined. Likewise, the other classes would have properties as required that control the way their processHtmlDocument behaves.

Thus, for CssInliner we might, for backwards compatibility, have inlineCss implemented as follows:

    public function inlineCss(string $css = ''): self
    {
        if (!isset($this->defaultHtmlDocument)) {
            throw new \BadMethodCallException('A default HTML document has not been set');
        }
        $this->setCss($css);
        $this->processHtmlDocument($this->defaultHtmlDocument);
        return $this;
    }

...where processHtmlDocument now does what inlineCss used to do, using the CSS that has been set as a property.

I haven't thought through how this would work for the classes that currently have multiple processing methods. An advantage of a single processing method following a defined interface is allowing HTML processors to be chained into an 'array' - which could have a class, also implementing the same interface, to manage the chaining.

A likely use case would be to set up a single HTML processor (e.g. either a CssInliner or an array of HTML processors including a CssInliner) to process multiple HTML documents. E.g. some kind mail merge sending multiple emails with different content but using the same styling. It will be more efficient for CssInliner to have processed the CSS just once, ready to apply it to each HTML document that is thrown at it.

Some kind of wrapper (or facade?) might involve the creation of the/each HtmlDocument from the raw HTML, running it through the processor, then obtaining the raw processed HTML with render or renderBodyContent.

@JakeQZ
Copy link
Contributor

JakeQZ commented Apr 13, 2021

Some kind of wrapper (or facade?) might involve the creation of the/each HtmlDocument from the raw HTML, running it through the processor, then obtaining the raw processed HTML with render or renderBodyContent.

HtmlDocument could have a method

    public function processWith(HtmlProcessor $processor): self
    {
        $processor->processHtmlDocument($this);
        return $this;
    }

which would allow method chaining such as

    $visualHtml = HtmlDocument::fromHtml($html)->processWith(new CssInliner($css))->render();

or

    $visualHtml = HtmlDocument::fromHtml($html)
        ->processWith(new HtmlProcessorArray([
            new CssInliner($css),
            new HtmlPruner(),
            new CssToAttributeConverter(),
        ]))
        ->render();

JakeQZ added a commit that referenced this issue Apr 13, 2021
Also remove a redundant cache for the parsed CSS - the CSS is only ever parsed
once per `CssInliner` instance.

Part of #734 and #544.  Relates to #994.
JakeQZ added a commit that referenced this issue Apr 15, 2021
JakeQZ added a commit that referenced this issue Apr 15, 2021
JakeQZ added a commit that referenced this issue Apr 15, 2021
@oliverklee
Copy link
Contributor Author

oliverklee commented Apr 16, 2021

To make the initial Emogrifier-provided class injectable as a service (and hence mockable in tests), the wrapper could provide an API like this:

// `$emogrifier` can also get injected instead of getting instantiated.
$emogrifier = new Emogrifier();

// minimal example without additional CSS
$result = $emogrifier->inlineCss($html)->render();

// only render the body; available at all places
$result = $emogrifier->inlineCss($html)->renderBody();

// minimal example with additional CSS
$result = $emogrifier->inlineCss($html, $css)->render();

// big example with multiple HTML processors
$result = $emogrifier->inlineCss($html, $css)->disabledInlineStyleAttribuesParsing()->someOtherCssInlinerMethod()
  ->removeClassAttributeFromElements()
  ->someOtherMethodFromHtmlPruner()
  ->render();

// Then do the same with some other HTML.
$result2 = $emogrifier->inlineCss($html2, $css)->render();

// Split up across multiple variables
$intermediateResult = $emogrifier->inlineCss($html, $css)->disabledInlineStyleAttribuesParsing()->someOtherCssInlinerMethod()
  ->removeClassAttributeFromElements()
$result = $intermediateResult->someOtherMethodFromHtmlPruner()
  ->render();

(I'm thinking of some clever way of passing around results and executing the HTML processors lazily.)

What do you think, @JakeQZ?

JakeQZ added a commit that referenced this issue Apr 16, 2021
JakeQZ added a commit that referenced this issue Apr 16, 2021
JakeQZ added a commit that referenced this issue Apr 17, 2021
JakeQZ added a commit that referenced this issue Apr 17, 2021
oliverklee pushed a commit that referenced this issue Apr 18, 2021
@JakeQZ
Copy link
Contributor

JakeQZ commented May 1, 2021

To make the initial Emogrifier-provided class injectable as a service (and hence mockable in tests), the wrapper could provide an API like this:

I think this wouldn't be mutually exclusive with other ideas in this thread.

Presubly these methods (apart from render) would return $this;.

I imagine internally the Emogrifier class would have properties which are the current HTML document, the current CSS, and any other settings which control the behaviour of the HTML processors. These might be indirect properties which belong to the HTML processors, with Emogrifier having the containing HTML processors as properties.

// Then do the same with some other HTML.
$result2 = $emogrifier->inlineCss($html2, $css)->render();

If the CSS was the same as the first time, we wouldn't want to re-parse it (so would perhaps re-use the previous CssInliner instance). Would this 'remember' the other methods that were called in the preceding 'big' example, or would these all need to be repeated?

inlineCss($html, $css)->disabledInlineStyleAttribuesParsing()

This would need lazy processing. I guess the class could internally build up a list of processing operations to perform, and only execute them when render is invoked.

@bytestream
Copy link

To weigh in from the perspective of #828. I'm not sure how far you want to go with respect to decoupling DOMDocument usage... The biggest benefit of the masterminds/html5 parser is that it's pretty seemless to integrate because it still uses DOMDocument.

If you wanted to fully decouple (OO), then what Jake suggested of using an interface would be the best way to go. That interface would have to implement all the currently used methods like getElementsByTagName(), xpath->query(), saveHtml(). A benefit of going this way I guess is that one could implement the interface themselves and you, as libraries authors, wouldn't have to worry about maintaining that integration.

Happy to help if an approach is agreed upon.

@oliverklee
Copy link
Contributor Author

@bytestream Then please let's start with the interface and the proxy implementing it for the DOM class. Would you be willing to tackle this task?

@JakeQZ
Copy link
Contributor

JakeQZ commented May 25, 2021

If you wanted to fully decouple (OO), then what Jake suggested of using an interface would be the best way to go.

To clarify, I was suggesting perhaps introducing an HtmlProcessor interface for AbstractHtmlProcessor after everything DOMDocument-related has been moved to another class. What I meant by decoupling was simply the moving of everything DOMDocument-related from AbstractHtmlProcessor to a new class, instances of which would then be provided to AbstractHtmlProcessor for processing.

That interface would have to implement all the currently used methods like getElementsByTagName(), xpath->query(), saveHtml().

Or provide higher-level methods such as getBodyElement which in turn use the DOMDocument methods.

The biggest benefit of the masterminds/html5 parser is that it's pretty seemless to integrate because it still uses DOMDocument.

Actually I don't think it will be practical to conceal the DOMDocument or allow the underlying implementation not to use it. Most methods will be returning DOMElements or DOMNodes, which in turn have references to the 'owner document'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants