-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a ThreadLocal to allow reusing parser instances #176
Conversation
Currently scala-xml creates a new SAXParser instance every time. This is not very good for performance, but explained by the fact that SAXParser is not thread safe. We can greatly increase performance by using a ThreadLocal, giving each thread their own instance.
are there implications for Scala.js here...? |
SAXParser is a Java library. Confusingly, a lot of source code is under |
As far as I can tell it should be ok. I've made the ThreadLocal a lazy val so that Scala.js can eliminate it, as it otherwise warned about a reference to missing class SAXParser$. |
Yeah, looks like the code has always been one-parser, one-file rule for a long time. Presumably, it was this way for similar reasons as the Scala.js issue -- of avoiding missing ClassNotFoundException, but back in the day it was a consequence of various make, model and year of JVM. |
Given the performance considerations, this seems worthwhile. Unfortunately, we can't entertain breaks in binary compatibility because of a defect in sbt that exposes the scala-xml dependency that ships with the compiler. Looking at the patch, it doesn't seem there is a way around that. Although, I didn't toy with it.
I'm keeping this open, because this and other improvements will be motivators for breaking binary compatibility in the, hopefully, very near future. |
@ashawley That's a shame. I wasn't aware of that sbt binary compatibility issue. Hopefully that can be resolved somehow. |
|
||
parser.setNamespaceAware(false) | ||
parser.newSAXParser() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool
@ashawley this change would be eligible for the 2.0 series, I think...? |
Just a quick note that if #177 is going to be merged as well, one or the other will need to be rebased depending on which is merged first. Let me know how you want to handle that. |
oh #177 depends on this one, so the other order makes more sense, it seems |
@SethTisue I see you've already rebased the patch. Thanks. |
Very old PR 👋 Anyone knows whether we should call /* Override this to use a different SAXParser. */
def parser: SAXParser = parserInstance.get https://docs.oracle.com/en/java/javase/21/docs/api/java.xml/javax/xml/parsers/SAXParser.html#reset()
|
I'm not sure. Looking at the reset method implementation it appears to mainly reset settings and content/DTD handlers, which might otherwise leak through to the next invocation, which may be expecting a blank slate. So it does seem like calling reset would be the right thing to do, to avoid this potential issue. It looks like the reset should not get rid of the security-relevant settings we apply, but that should be verified to be safe. |
Currently scala-xml creates a new SAXParser instance every time. This is not ideal for perfomance, but explained by the fact that SAXParser is not thread safe. We can greatly increase performance by using a ThreadLocal, giving each thread their own instance.
From a quick benchmark this is ~6x faster in both the single- and multithreaded case: