Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading and writing of HTML (compatible) to a simple ADT format. #1680

Merged
merged 23 commits into from
Oct 21, 2022

Conversation

jurgenvinju
Copy link
Member

  • first version, not finished
  • HTML parser produces AST nodes now instead of node
  • perfecting the HTML reader
  • fixed static error in m3 Core
  • added yielding of HTMLElement as XHTML output, both string and file
  • set copyright year
  • fixed some more documentation

@jurgenvinju
Copy link
Member Author

This is a simple design made to facilitate demonstrating the use of HTML and JS with Rascal:

  • all HTML can be parsed to the `HTMLElement' ADT; a liberal parser is used to allow for scraping "real" internet documents
  • all HTMLElements can be yielded back to HTML strings using the XML DOM of w3c (used to be xerces). This is also a liberal feature (incomplete documents are allowed), but it uses a strict XHTML notation as output (all brackets balanced and stuff).
  • the code is supposed to handle escaping and de-escaping by itself, but that is not tested yet.

@codecov
Copy link

codecov bot commented Oct 18, 2022

Codecov Report

Merging #1680 (7620804) into main (0a09f35) will increase coverage by 0%.
The diff coverage is 69%.

@@           Coverage Diff           @@
##              main   #1680   +/-   ##
=======================================
  Coverage       48%     48%           
- Complexity    5953    5967   +14     
=======================================
  Files          672     672           
  Lines        58463   58508   +45     
  Branches      8514    8522    +8     
=======================================
+ Hits         28355   28411   +56     
+ Misses       27961   27942   -19     
- Partials      2147    2155    +8     
Impacted Files Coverage Δ
src/org/rascalmpl/uri/URIResolverRegistry.java 49% <0%> (-1%) ⬇️
src/org/rascalmpl/library/lang/html/IO.java 67% <71%> (+67%) ⬆️
src/org/rascalmpl/uri/FileTree.java 59% <0%> (-6%) ⬇️
...calmpl/interpreter/result/SetOrRelationResult.java 82% <0%> (-6%) ⬇️
...g/rascalmpl/interpreter/result/DateTimeResult.java 15% <0%> (-1%) ⬇️
src/org/rascalmpl/interpreter/result/Result.java 13% <0%> (-1%) ⬇️
src/org/rascalmpl/library/Prelude.java 42% <0%> (-1%) ⬇️
src/org/rascalmpl/ast/Expression.java 25% <0%> (+<1%) ⬆️
...rg/rascalmpl/util/ExpiringFunctionResultCache.java 76% <0%> (+5%) ⬆️
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@jurgenvinju
Copy link
Member Author

yes, escaping works:

rascal>writeHTMLString(b([text("aap & noot")]))
str: "\<html\>\n    \<b\>aap &amp; noot\</b\>\n\</html\>\n"
rascal>readHTML
readHTMLFile     readHTMLString   
rascal>readHTMLString("\<html\>\n    \<b\>aap &amp; noot\</b\>\n\</html\>\n")
HTMLElement: html([
    head([]),
    body([b([text("aap & noot")])])
  ])
  

@jurgenvinju
Copy link
Member Author

With this in the library the tutor can be used to demonstrate visuals without an apriori dependency on salix, also basic visuals like tables for relations can be part of the standard library. This should enhance the usability of the "vanilla" Rascal repl quite a bit, and achieve the kind of user interactions we used to have in Eclipse.

For now this initial version seems finished and I'm ready to merge.

@DavyLandman
Copy link
Member

DavyLandman commented Oct 19, 2022

How does this relate to lang::html5::DOM? Can we remove that?

And maybe similarly, can you explain the relation to lang::html::IO?

Also, we need tests. (I think before adding new libraries, we need to try and add a tests).

@@ -117,7 +117,7 @@ M3 createM3FromDirectory(loc project, bool errorRecovery = false, bool includeJa
registerProject(project, result);

if (includeJarModels) {
results = composeJavaM3(project, {result, *{createM3FromJar(j, classPaths) | j <- classPaths}});
results = composeJavaM3(project, {result, *{createM3FromJar(j, classPath=classPaths) | j <- classPaths}});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, this is not related to this branch right? just accidental changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. My bad. Saw it and couldn't leave it there to rot but also didn't have the energy to checkout another branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, we've all been there.

@@ -428,13 +425,17 @@ str nodeToString(str n, set[HTML5Attr] attrs, list[value] kids) {
return s;
}

public HTML5Node example
private HTML5Node example
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this some kind of inline documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know yet. Have to ask Tijs. I think it's test code

@@ -87,29 +209,66 @@ private void storeAttributes(MutableAttributeSet set) {
for (Enumeration<?> names = set.getAttributeNames(); names.hasMoreElements(); ) {
Object label = names.nextElement();
Object value = set.getAttribute(label);
attributes.peek().put(label.toString(), factory.string(value.toString()));
if (!"_implied_".equals(label)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments on what is happening here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The editor kit puts implied false or true on every node but that's a leaky internal implementation detail not a DOM aspect imho. We can use it to filter implied nodes but I didn't see a need for that. The implied nodes complete the dom semantically. Like a normal form.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the "editor kit"?

I meant, this magic if should have an inline comment to explain the reasoning. For future you ;)

/**
* Produces a HTML string output from an HTMLElement AST.
*
* Why go through all the trouble of building a DOM? The only reason is compliance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does introduce scaling issues, so we shouldn't be generating big HTML trees :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Big html trees are perfectly fine with this DOM tree. Building them is linear time and space.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory consumption. This is a thing with large xml & json files, but I'm guessing nobody is gonna build such a big html.

* Escapes, encodings, etc. all are maintained by these classes from org.w3c.dom.
*/
public IString writeHTMLString(IConstructor cons) {
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can maybe share more code with the writeHTMLFile call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks for the suggestion. Especially since this code will become twice as long when we add the right XHTML dtd and wrap internal nodes with proper header and body.

public IValue readHTMLString(IString string) {
if (string.length() == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should really think about adding isEmpty to vallang IString, especially since length is not always a o(1) operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it's log(n) max with the current balanced tree implementation, so you'd have to call it many times on a deep tree before you would notice any improvement. It's doable though to add isEmpty with a default false implementation in the interface and only return true for the maximally shared empty constant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also length is cached in internal nodes I believe. So it's probably even in O(1) already.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a default method should be: return length() == 0; and then subclasses can override it with true or false.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's Have a look at IStringImplementation first. It's a private interface used by all implementation classes of IString, all of them but one are non-empty I believe. And that one is empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just saying the default interface method should work for future overloads as well.

@jurgenvinju
Copy link
Member Author

jurgenvinju commented Oct 19, 2022

How does this relate to lang::html5::DOM? Can we remove that?

It's a different way of representing HTML that has its own benefits and pitfalls. What I didn't like is that it's match patterns are different from its constructor patterns. This AST version does have that. Also this AST version imitates the way we model programming languages and DSLs so I thought that more appropriate for examples and demo's. Finally the unparser including all of the intricate escaping rules for XHTML is reused here which is probably for the best.

@jurgenvinju
Copy link
Member Author

jurgenvinju commented Oct 19, 2022

And maybe similarly, can you explain the relation to lang::html::IO?

This pure data AST module contains only the data declarations that the IO module uses when reading and writing the HTML streams.

Also, we need tests. (I think before adding new libraries, we need to try and add a tests).

Yes! Very much agreed. Working on that already, and also nice documentation that also functions as tests. I've a scraper from the CBS cite and planning a static site generator for the other direction. Tutor will be extended with browser screenshots for the testing and the visual docs.

Might have to merge this one though before I can write the tests properly with the tutor. Now that it's a separate project I'm running into cycles now and then.

@DavyLandman
Copy link
Member

It's a different way of representing HTML that has its own benefits and pitfalls. What I didn't like is that it's match patterns are different from its constructor patterns. This AST version does have that. Also this AST version imitates the way we model programming languages and DSLs so I thought that more appropriate for examples and demo's. Finally the unparser including all of the intricate escaping rules for XHTML is reused here which is probably for the best.

do we need both? can we not cleanup? since they appear very similar from a user perspective.

@jurgenvinju jurgenvinju marked this pull request as draft October 20, 2022 17:00
@jurgenvinju
Copy link
Member Author

There is still some options to make this faster and even better tested, but for now I'm done and I want to use it and document it with the Tutor.

@jurgenvinju jurgenvinju marked this pull request as ready for review October 21, 2022 10:53
pom.xml Outdated
<artifactId>jdom2</artifactId>
<version>2.0.6</version>
<artifactId>jdom</artifactId>
<version>2.0.2</version>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why go backwards?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accident! on grand central the example was 2.0.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jdom2!! not jdom. that's why the version was off too

try (OutputStream out = URIResolverRegistry.getInstance().getOutputStream(file, false)) {
Document doc = createHTMLDocument(cons);
doc = doc.outputSettings(createOutputSettings(charset.getValue(), escapeMode.getName(), outline.getValue(), prettyPrint.getValue(), indentAmount.intValue(), maxPaddingWidth.intValue(), syntax.getName()));
out.write(doc.outerHtml().getBytes("UTF-8"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we ask the registry for a Writer or something? so that we don't have to do .getBytes on the string?

* Escapes, encodings, etc. all are maintained by these classes from org.jdom.
*/
public IString writeHTMLString(IConstructor cons, IString charset, IConstructor escapeMode, IBool outline, IBool prettyPrint, IInteger indentAmount, IInteger maxPaddingWidth, IConstructor syntax ) {
try (StringWriter out = new StringWriter()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like out isn't used?

@jurgenvinju jurgenvinju merged commit 061a3ef into main Oct 21, 2022
@jurgenvinju jurgenvinju deleted the html-ast branch October 21, 2022 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants