Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a ThreadLocal to allow reusing parser instances #176

Merged
merged 2 commits into from
Mar 4, 2021
Merged

Use a ThreadLocal to allow reusing parser instances #176

merged 2 commits into from
Mar 4, 2021

Conversation

fgoepel
Copy link
Contributor

@fgoepel fgoepel commented Nov 16, 2017

Currently scala-xml creates a new SAXParser instance every time. This is not ideal for perfomance, but explained by the fact that SAXParser is not thread safe. We can greatly increase performance by using a ThreadLocal, giving each thread their own instance.

From a quick benchmark this is ~6x faster in both the single- and multithreaded case:

Parse XML (stock):
115 μs, 109 μs, 114 μs, 109 μs, 110 μs, 110 μs, 112 μs, 113 μs, 110 μs, 108 μs, 
median: 110 μs

Parse XML (stock async):
69 μs, 36 μs, 39 μs, 50 μs, 38 μs, 41 μs, 39 μs, 40 μs, 38 μs, 41 μs, 
median: 40 μs

Parse XML (new):
18 μs, 18 μs, 27 μs, 16 μs, 17 μs, 16 μs, 17 μs, 16 μs, 17 μs, 17 μs, 
median: 17 μs

Parse XML (new async):
5 μs, 7 μs, 10 μs, 4 μs, 7 μs, 8 μs, 10 μs, 6 μs, 7 μs, 7 μs, 
median: 7 μs

Currently scala-xml creates a new SAXParser instance every time. This
is not very good for performance, but explained by the fact that
SAXParser is not thread safe. We can greatly increase performance by
using a ThreadLocal, giving each thread their own instance.
@SethTisue
Copy link
Member

are there implications for Scala.js here...?

@ashawley
Copy link
Member

SAXParser is a Java library. Confusingly, a lot of source code is under shared, but not all of it actually works in Scala.js at runtime.

@fgoepel
Copy link
Contributor Author

fgoepel commented Nov 16, 2017

are there implications for Scala.js here...?

As far as I can tell it should be ok. I've made the ThreadLocal a lazy val so that Scala.js can eliminate it, as it otherwise warned about a reference to missing class SAXParser$.

@ashawley
Copy link
Member

Yeah, looks like the code has always been one-parser, one-file rule for a long time. Presumably, it was this way for similar reasons as the Scala.js issue -- of avoiding missing ClassNotFoundException, but back in the day it was a consequence of various make, model and year of JVM.

@ashawley
Copy link
Member

Given the performance considerations, this seems worthwhile. Unfortunately, we can't entertain breaks in binary compatibility because of a defect in sbt that exposes the scala-xml dependency that ships with the compiler.

Looking at the patch, it doesn't seem there is a way around that. Although, I didn't toy with it.

[info] scala-xml: found 1 potential binary incompatibilities while checking against org.scala-lang.modules:scala-xml_2.11:1.0.6 
[error]  * synthetic method scala$xml$factory$XMLLoader$$parserInstance()java.lang.ThreadLocal in trait scala.xml.factory.XMLLoader is present only in current version
[error]    filter with: ProblemFilters.exclude[ReversedMissingMethodProblem]("scala.xml.factory.XMLLoader.scala$xml$factory$XMLLoader$$parserInstance")

I'm keeping this open, because this and other improvements will be motivators for breaking binary compatibility in the, hopefully, very near future.

@fgoepel
Copy link
Contributor Author

fgoepel commented Feb 22, 2018

@ashawley That's a shame. I wasn't aware of that sbt binary compatibility issue. Hopefully that can be resolved somehow.

@ashawley ashawley added this to the 2.0 milestone Feb 23, 2018

parser.setNamespaceAware(false)
parser.newSAXParser()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool

@SethTisue
Copy link
Member

@ashawley this change would be eligible for the 2.0 series, I think...?

@fgoepel
Copy link
Contributor Author

fgoepel commented Dec 6, 2020

Just a quick note that if #177 is going to be merged as well, one or the other will need to be rebased depending on which is merged first. Let me know how you want to handle that.

@SethTisue SethTisue self-assigned this Mar 3, 2021
@SethTisue
Copy link
Member

@shado23 let's land #176 first (after one last short wait for feedback/objections, since I did a round of publicity today), then this one.

@SethTisue
Copy link
Member

oh #177 depends on this one, so the other order makes more sense, it seems

@SethTisue SethTisue merged commit 43dc3ef into scala:master Mar 4, 2021
@fgoepel
Copy link
Contributor Author

fgoepel commented Mar 6, 2021

@SethTisue I see you've already rebased the patch. Thanks.

@lrytz
Copy link
Member

lrytz commented Oct 24, 2024

Very old PR 👋 Anyone knows whether we should call reset here?

  /* Override this to use a different SAXParser. */
  def parser: SAXParser = parserInstance.get

https://docs.oracle.com/en/java/javase/21/docs/api/java.xml/javax/xml/parsers/SAXParser.html#reset()

reset() is designed to allow the reuse of existing SAXParsers thus saving resources associated with the creation of new SAXParsers.

@fgoepel
Copy link
Contributor Author

fgoepel commented Oct 24, 2024

I'm not sure. Looking at the reset method implementation it appears to mainly reset settings and content/DTD handlers, which might otherwise leak through to the next invocation, which may be expecting a blank slate. So it does seem like calling reset would be the right thing to do, to avoid this potential issue. It looks like the reset should not get rid of the security-relevant settings we apply, but that should be verified to be safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants