Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
* Build: added CI coverage for JDK 25 [#2403](https://github.com/jhy/jsoup/pull/2403)
* Build: added a CI fuzzer for contextual fragment parsing (in addition to existing full body HTML and XML fuzzers). [oss-fuzz #14041](https://github.com/google/oss-fuzz/pull/14041)

### Changes
* Set a removal schedule of jsoup 1.24.1 for previously deprecated APIs.

### Bug Fixes
* Previously cached child Elements of an Element were not correctly invalidated in `Node#replaceWith(Node)`, which could lead to incorrect results when subsequently calling `Element#children()`. [#2391](https://github.com/jhy/jsoup/issues/2391)
* Attribute selector values are now compared literally without trimming. Previously, jsoup trimmed whitespace from selector values and from element attribute values, which could cause mismatches with browser behavior (e.g. `[attr=" foo "]`). Now matches align with the CSS specification and browser engines. [#2380](https://github.com/jhy/jsoup/issues/2380)
Expand All @@ -28,6 +31,10 @@
* When using StructuralEvaluators (e.g., a `parent child` selector) across many retained threads, their memoized results could also be retained, increasing memory use. These results are now cleared immediately after use, reducing overall memory consumption. [#2411](https://github.com/jhy/jsoup/issues/2411)
* Cloning a `Parser` now preserves any custom `TagSet` applied to the parser. [#2422](https://github.com/jhy/jsoup/issues/2422), [#2423](https://github.com/jhy/jsoup/pull/2423)
* Custom tags marked as `Tag.Void` now parse and serialize like the built-in void elements: they no longer consume following content, and the XML serializer emits the expected self-closing form. [#2425](https://github.com/jhy/jsoup/issues/2425)
* The `<br>` element is once again classified as an inline tag (`Tag.isBlock() == false`), matching common developer expectations and its role as phrasing content in HTML, while pretty-printing and text extraction continue to treat it as a line break in the rendered output. [#2387](https://github.com/jhy/jsoup/issues/2387), [#2439](https://github.com/jhy/jsoup/issues/2439)
* Fixed an intermittent truncation when fetching and parsing remote documents via `Jsoup.connect(url).get()`. On responses without a charset header, the initial charset sniff could sometimes (depending on buffering / `available()` behavior) be mistaken for end-of-stream and a partial parse reused, dropping trailing content. [#2448](https://github.com/jhy/jsoup/issues/2448)
* TagSet copies no longer mutate their template during lazy lookups, preventing cross-thread `ConcurrentModificationException` when parsing with shared sessions. [#2453](https://github.com/jhy/jsoup/pull/2453)


### Internal Changes
* Deprecated internal helper `org.jsoup.internal.Functions` (for removal in v1.23.1). This was previously used to support older Android API levels without full `java.util.function` coverage; jsoup now requires core library desugaring so this indirection is no longer necessary. [#2412](https://github.com/jhy/jsoup/pull/2412)
Expand Down Expand Up @@ -369,4 +376,4 @@

---
Older changes for versions 0.1.1 (2010-Jan-31) through 1.17.1 (2023-Nov-27) may be found in
[change-archive.txt](./change-archive.txt).
[change-archive.txt](./change-archive.txt).
42 changes: 33 additions & 9 deletions src/main/java/org/jsoup/parser/TagSet.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,24 +25,48 @@ public class TagSet {
static final TagSet HtmlTagSet = initHtmlDefault();

private final Map<String, Map<String, Tag>> tags = new HashMap<>(); // namespace -> tag name -> Tag
private final @Nullable TagSet source; // source to pull tags from on demand
private final @Nullable TagSet source; // internal fallback for lazy tag copies
private @Nullable ArrayList<Consumer<Tag>> customizers; // optional onNewTag tag customizer

/**
Returns a mutable copy of the default HTML tag set.
*/
public static TagSet Html() {
return new TagSet(HtmlTagSet);
return new TagSet(HtmlTagSet, null);
}

private TagSet(@Nullable TagSet source, @Nullable ArrayList<Consumer<Tag>> customizers) {
this.source = source;
this.customizers = customizers;
}

public TagSet() {
source = null;
this(null, null);
}

/**
Creates a new TagSet by copying the current tags and customizers from the provided source TagSet. Changes made to
one TagSet will not affect the other.
@param template the TagSet to copy
*/
public TagSet(TagSet template) {
this(template.source, copyCustomizers(template));
// copy tags eagerly; any lazy pull-through should come only from the root source (which would be the HTML defaults), not the template itself.
// that way the template tagset is not mutated when we do read through
if (template.tags.isEmpty()) return;

for (Map.Entry<String, Map<String, Tag>> namespaceEntry : template.tags.entrySet()) {
Map<String, Tag> nsTags = new HashMap<>(namespaceEntry.getValue().size());
for (Map.Entry<String, Tag> tagEntry : namespaceEntry.getValue().entrySet()) {
nsTags.put(tagEntry.getKey(), tagEntry.getValue().clone());
}
tags.put(namespaceEntry.getKey(), nsTags);
}
}

public TagSet(TagSet original) {
this.source = original;
if (original.customizers != null)
this.customizers = new ArrayList<>(original.customizers);
private static @Nullable ArrayList<Consumer<Tag>> copyCustomizers(TagSet base) {
if (base.customizers == null) return null;
return new ArrayList<>(base.customizers);
}

/**
Expand Down Expand Up @@ -212,7 +236,7 @@ static TagSet initHtmlDefault() {
String[] blockTags = {
"html", "head", "body", "frameset", "script", "noscript", "style", "meta", "link", "title", "frame",
"noframes", "section", "nav", "aside", "hgroup", "header", "footer", "p", "h1", "h2", "h3", "h4", "h5",
"h6", "br", "button",
"h6", "button",
"ul", "ol", "pre", "div", "blockquote", "hr", "address", "figure", "figcaption", "form", "fieldset", "ins",
"del", "dl", "dt", "dd", "li", "table", "caption", "thead", "tfoot", "tbody", "colgroup", "col", "tr", "th",
"td", "video", "audio", "canvas", "details", "menu", "plaintext", "template", "article", "main",
Expand Down Expand Up @@ -279,4 +303,4 @@ private TagSet setupTags(String namespace, String[] tagNames, Consumer<Tag> tagM
}
return this;
}
}
}
51 changes: 50 additions & 1 deletion src/test/java/org/jsoup/integration/SessionTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,14 @@
import org.jsoup.integration.servlets.FileServlet;
import org.jsoup.nodes.Document;
import org.jsoup.parser.Parser;
import org.jsoup.parser.Tag;
import org.jsoup.parser.TagSet;
import org.jsoup.select.Elements;
import org.junit.jupiter.api.BeforeAll;
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.lang.reflect.Field;
import java.util.Map;

import static org.junit.jupiter.api.Assertions.assertEquals;
Expand Down Expand Up @@ -134,4 +137,50 @@ public void testCanChangeParsers() throws IOException {
Document doc3 = session.newRequest().url(xmlUrl).get();
assertEquals(xmlVal, doc3.html()); // did not blow away xml default
}
}

@Test
public void sessionTagSetDoesNotMutateRoot() {
Connection session = Jsoup.newSession();
TagSet rootTags = session.request().parser().tagSet();

int rootNamespacesBefore = tagSetNamespaceCount(rootTags);

Connection request = session.newRequest();
Parser parser = request.request().parser();
parser.parseInput("<custom>One <b>Two</b></custom>", "http://example.com/");

int rootNamespacesAfter = tagSetNamespaceCount(rootTags);
assertEquals(rootNamespacesBefore, rootNamespacesAfter);
}

@Test
public void sessionTagSetCustomizerDoesNotMutateRoot() {
Connection session = Jsoup.newSession();
TagSet rootTags = session.request().parser().tagSet();
rootTags.onNewTag(tag -> {
if (!tag.isKnownTag())
tag.set(Tag.RcData);
});

int rootNamespacesBefore = tagSetNamespaceCount(rootTags);

Connection request = session.newRequest();
Parser parser = request.request().parser();
Document doc = parser.parseInput("<custom>One <b>Two</b></custom>", "https://example.com/");
assertEquals(0, doc.select("custom b").size());

int rootNamespacesAfter = tagSetNamespaceCount(rootTags);
assertEquals(rootNamespacesBefore, rootNamespacesAfter);
}

private static int tagSetNamespaceCount(TagSet tagSet) {
try {
Field tagsField = TagSet.class.getDeclaredField("tags");
tagsField.setAccessible(true);
Map<?, ?> tags = (Map<?, ?>) tagsField.get(tagSet);
return tags.size();
} catch (ReflectiveOperationException e) {
throw new RuntimeException(e);
}
}
}
6 changes: 3 additions & 3 deletions src/test/java/org/jsoup/parser/ParserTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,9 @@ public void testCloneCopyTagSet() {
// Ensure onNewTag customizers are retained
Tag custom = clone.tagSet().valueOf("qux", Parser.NamespaceHtml);
assertTrue(custom.isSelfClosing());
// Check that cloned tagset uses the original tag as source when original is modified
// Check that cloned tagset does not observe modifications made to the original
assertNull(clone.tagSet().get("bar", Parser.NamespaceHtml));
parser.tagSet().add(new Tag("bar"));
assertNotNull(clone.tagSet().get("bar", Parser.NamespaceHtml));
assertNull(clone.tagSet().get("bar", Parser.NamespaceHtml));
}
}
}
38 changes: 37 additions & 1 deletion src/test/java/org/jsoup/parser/TagSetTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
import org.jsoup.nodes.Element;
import org.junit.jupiter.api.Test;

import java.lang.reflect.Field;
import java.util.Map;
import java.util.concurrent.atomic.AtomicInteger;

import static org.jsoup.parser.Parser.NamespaceHtml;
import static org.junit.jupiter.api.Assertions.*;

Expand Down Expand Up @@ -182,4 +186,36 @@ public class TagSetTest {
assertTrue(copy.valueOf("custom-tag", NamespaceHtml).is(Tag.Void));
assertFalse(source.valueOf("custom-tag", NamespaceHtml).is(Tag.Void));
}
}

@Test void copyPullThroughDoesNotMutateSource() {
TagSet source = TagSet.Html();
TagSet copy = new TagSet(source);

int sourceNamespacesBefore = tagSetNamespaceCount(source);
assertNotNull(copy.get("div", NamespaceHtml));
int sourceNamespacesAfter = tagSetNamespaceCount(source);
assertEquals(sourceNamespacesBefore, sourceNamespacesAfter);
}

@Test void copyPullWithCustomizerThroughDoesNotMutateSource() {
TagSet source = TagSet.Html();
TagSet copy = new TagSet(source);

AtomicInteger sourceAdds = new AtomicInteger();
source.onNewTag(tag -> sourceAdds.incrementAndGet());

assertNotNull(copy.get("div", NamespaceHtml));
assertEquals(0, sourceAdds.get());
}

private static int tagSetNamespaceCount(TagSet tagSet) {
try {
Field tagsField = TagSet.class.getDeclaredField("tags");
tagsField.setAccessible(true);
Map<?, ?> tags = (Map<?, ?>) tagsField.get(tagSet);
return tags.size();
} catch (ReflectiveOperationException e) {
throw new RuntimeException(e);
}
}
}