-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify that we get errors if the files contain a DOCTYPE #560
Conversation
All windows tests are failing currently. A CRLF issue - loaders are likely now standardizing CR to LF where they were perhaps not before. It's somehow related to the use of the constructing parser instead of xerces when loading. |
Should I take a look now even though the tests are still failing with one error: |
Nah. I have quite a bit to look at here.
There is a race condition due to multithreaded test execution where tests pass if run individually, but fail if run collectively.
So the loader changes I've made are somewhere sharing state.
I have to chase that down.
I am changing this back to a WIP.
…________________________________
From: John Interrante ***@***.***>
Sent: Monday, May 17, 2021 10:30 AM
To: apache/daffodil ***@***.***>
Cc: Beckerle, Mike ***@***.***>; Author ***@***.***>
Subject: Re: [apache/daffodil] Verify that we get errors if the files contain a DOCTYPE (#560)
Should I take a look now even though the tests are still failing with one error: [error] Test org.apache.daffodil.xml.TestXMLLoaderWithLocation.testFile1 failed: null, took 0.028 sec?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#560 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AALUDA7H62RPY24ZWAOXUNLTOESAPANCNFSM44ZBN3MA>.
|
Changed back to a WIP - not yet ready for review. I have discovered there is state sharing causing a multi-thread error. Tests pass when run one-by one in an IDE, or individually under sbt, but when run as a batch under sbt they fail. |
If the checks all pass, then this is ready for review. |
daffodil-core/src/test/scala/org/apache/daffodil/processor/TestSAXUtils.scala
Show resolved
Hide resolved
daffodil-lib/src/main/scala/org/apache/daffodil/xml/DaffodilXMLLoader.scala
Show resolved
Hide resolved
daffodil-lib/src/main/scala/org/apache/daffodil/xml/DaffodilXMLLoader.scala
Outdated
Show resolved
Hide resolved
daffodil-lib/src/main/scala/org/apache/daffodil/xml/DaffodilXMLLoader.scala
Show resolved
Hide resolved
daffodil-lib/src/main/scala/org/apache/daffodil/xml/DaffodilXMLLoader.scala
Outdated
Show resolved
Hide resolved
daffodil-lib/src/main/resources/org/apache/daffodil/xsd/tdml.xsd
Outdated
Show resolved
Hide resolved
daffodil-core/src/main/scala/org/apache/daffodil/dsom/DFDLSchemaFile.scala
Show resolved
Hide resolved
daffodil-core/src/test/scala/org/apache/daffodil/processor/TestSAXUtils.scala
Outdated
Show resolved
Hide resolved
daffodil-core/src/test/scala/org/apache/daffodil/processor/TestSAXUtils.scala
Outdated
Show resolved
Hide resolved
daffodil-core/src/test/scala/org/apache/daffodil/util/TestUtils.scala
Outdated
Show resolved
Hide resolved
daffodil-japi/src/test/java/org/apache/daffodil/example/TestJavaAPI.java
Outdated
Show resolved
Hide resolved
...-processor/src/main/scala/org/apache/daffodil/tdml/processor/DaffodilTDMLDFDLProcessor.scala
Show resolved
Hide resolved
daffodil-tdml-processor/src/test/scala/org/apache/daffodil/tdml/TestTDMLRunnerConfig.scala
Outdated
Show resolved
Hide resolved
daffodil-tdml-processor/src/test/scala/org/apache/daffodil/tdml/TestTDMLUnparseCases.scala
Outdated
Show resolved
Hide resolved
daffodil-test/src/test/resources/org/apache/daffodil/section06/entities/charClassEntities.tdml
Outdated
Show resolved
Hide resolved
daffodil-tdml-lib/src/main/scala/org/apache/daffodil/tdml/TDMLRunner.scala
Outdated
Show resolved
Hide resolved
Should I review now or wait? I wanted to wait until all the checks pass (which they do now), but your comments about some tests needing unexplainable changes makes me wonder if we should wait for more investigation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to all the review comments. Quite a bit of cleanup here.
There are a couple of CLI tests where I left in LOUD comments to review them, because I'm not sure why I had to change them at all.
daffodil-core/src/main/scala/org/apache/daffodil/dsom/DFDLSchemaFile.scala
Show resolved
Hide resolved
daffodil-core/src/test/scala/org/apache/daffodil/processor/TestSAXUtils.scala
Outdated
Show resolved
Hide resolved
daffodil-lib/src/main/scala/org/apache/daffodil/validation/XercesValidator.scala
Show resolved
Hide resolved
class DaffodilConstructingLoader(uri: URI, | ||
errorHandler: org.xml.sax.ErrorHandler, | ||
addPositionAttributes: Boolean = false, | ||
normalizeCRLFtoLF: Boolean = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To centralize the loader in one place so that one can insure it's always set up to be secure, you then need that one loader to support different modes of operation.
We only need the line/file position stuff when we're loading DFDL schemas, and TDML files since those embed DFDL Schemas. Other XML loading (like TDML infoset files) don't need that.
The normalize CRLF to LF stuff should always be true, as that is the behavior of the regular xerces loader, but was not the case for the ConstructingParser which we need to use because we need CDATA nodes preserved. There was some code depending on this non-CRLF-normalizing behavior that had crept in. I needed this flag to figure things out.
At this point, I could remove the normalizeCRLFtoLF flag, because it is only used by tests that show it works.
Given how non-standard loading XML, while preserving CRLFs or CRs is, I should probably remove it.
I am leaving this feature in for now, because this is, in effect, an incompatible change to the TDML runner. It's possible people other than us have TDML, and that TDML may be depending on the TDML Runner loading the tdml file, or infoset XML files, with CRLFs preserved.
} | ||
finalVersion | ||
} | ||
val othersmatch = otherStrings.forall { case st: String => | ||
val stIsEmpty = st == null || res == "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll fix this and see if anything breaks.
daffodil-tdml-processor/src/main/scala/org/apache/daffodil/tdml/TDMLInfosetInputter.scala
Show resolved
Hide resolved
val dp = | ||
val dp = { | ||
// Always serialize the parser/unparser objects. | ||
// This insures nothing is left uncomputed by lazy evaluation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this from here. It now only saves and reloads if useSerializedProcessor.
This is the compiler's problem to insure that everything is fully evaluated, not code out here using the data processor.
daffodil-tdml-processor/src/test/scala/org/apache/daffodil/tdml/TestTDMLUnparseCases.scala
Outdated
Show resolved
Hide resolved
I just pushed fixes to all SL's comments.
Barring major failure again on the CI tests, yes now would be good for a review.
…________________________________
From: John Interrante ***@***.***>
Sent: Friday, May 21, 2021 10:26 AM
To: apache/daffodil ***@***.***>
Cc: Beckerle, Mike ***@***.***>; Author ***@***.***>
Subject: Re: [apache/daffodil] Verify that we get errors if the files contain a DOCTYPE (#560)
Should I review now or wait? I wanted to wait until all the checks pass (which they do now), but your comments about some tests needing unexplainable changes makes me wonder if we should wait for more investigation.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#560 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AALUDA7EG42JQFOODSD3X5LTOZURLANCNFSM44ZBN3MA>.
|
This change set also fixes DAFFODIL-1816 |
daffodil-core/src/main/scala/org/apache/daffodil/runtime1/SchemaSetRuntime1Mixin.scala
Show resolved
Hide resolved
override def newXMLReaderInstance: DFDL.DaffodilParseXMLReader = new DaffodilParseXMLReader(this) | ||
override def newXMLReaderInstance: DFDL.DaffodilParseXMLReader = { | ||
val xrdr = new DaffodilParseXMLReader(this) | ||
XMLUtils.setSecureDefaults(xrdr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on just removing this call? We aren't getting a random XMLReader here based on some factory that happens to be on the classpath. We know it's a Daffodil specific XML reader, and we know it's secure from these XML issues. Removing this would potentially mean we could remove the new logic added in set/getFeature to accept features that we don't really support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to finesse this is to make DaffodilParseXMLReader use a factory and have a private constructor. That way how it is initialized after construction need not be the caller's responsibility. I will change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd argue that calling setSecureDefaults is no one's responsiblity, including ours. So a factory just adds extra API complexity. The DaffodilXMLReader doesn't actually read XML (only implements the XMLReader API)--there are no concerns about reading XML securely, so no one, including us, ever needs to set it in a secure mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I think we need scaladoc to point out that this is by definition secure if we remove the ability to set the secure features, otherwise someone's going to try it, and end up not liking that they can't set it secure.
daffodil-cli/src/it/scala/org/apache/daffodil/debugger/TestCLIDebugger.scala
Outdated
Show resolved
Hide resolved
daffodil-core/src/main/scala/org/apache/daffodil/runtime1/SchemaSetRuntime1Mixin.scala
Outdated
Show resolved
Hide resolved
There was a comment I found in an email but can't currently find it in github, whree @mbeckerle said:
I'm wondering if we should just be explicitly creating the XML Reader that we are expecting? As you said, we depend on very specific behavior from a specif XMLReader, so using something like Would it then even be possible to have a daffodil-lib function that gets the things XMLReader stuff we need and set the secure defaults, so there's only one place we need to call setSecureDefaults, with the assumptino that everything gets the XMLReader from our new daffodil-lib function? |
re: constructing the specific SAXParser class we need. I've introduced a new DaffodilSAXParserFactory, which insures that the SAXParser one gets is exactly the Xerces one. Also insures that its XMLReader is secure. |
daffodil-japi/src/main/java/org/apache/daffodil/japi/package-info.java
Outdated
Show resolved
Hide resolved
daffodil-lib/src/main/scala/org/apache/daffodil/xml/DaffodilSAXParserFactory.scala
Outdated
Show resolved
Hide resolved
...-processor/src/main/scala/org/apache/daffodil/tdml/processor/DaffodilTDMLDFDLProcessor.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
We're getting close to the end. I think just a few minor cleanups are left.
daffodil-cli/src/it/scala/org/apache/daffodil/debugger/TestCLIDebugger.scala
Outdated
Show resolved
Hide resolved
daffodil-cli/src/it/scala/org/apache/daffodil/debugger/TestCLIDebugger.scala
Outdated
Show resolved
Hide resolved
daffodil-core/src/main/scala/org/apache/daffodil/dsom/DFDLSchemaFile.scala
Show resolved
Hide resolved
daffodil-core/src/main/scala/org/apache/daffodil/dsom/DFDLSchemaFile.scala
Show resolved
Hide resolved
daffodil-sapi/src/test/scala/org/apache/daffodil/example/TestScalaAPI.scala
Show resolved
Hide resolved
daffodil-sapi/src/test/scala/org/apache/daffodil/example/TestScalaAPI.scala
Show resolved
Hide resolved
daffodil-tdml-lib/src/main/scala/org/apache/daffodil/tdml/TDMLRunner.scala
Show resolved
Hide resolved
...-processor/src/main/scala/org/apache/daffodil/tdml/processor/DaffodilTDMLDFDLProcessor.scala
Show resolved
Hide resolved
Had one more though, this mostly changes how we parse things like TDML files, DFDL Schemas, etc. But I think it leaves out some of the InfosetInputters? I assume the Scala, JDOM, and W3CDOM don't have these security issues since the XML was parsed into these data structures and so are safe to use. But the XMLTextInfosetInputter and JSON use the woodstox and jackson libraries, respectively. Neither of these use an XMLReader of parse. The XML one is StAX which uses a different API. And the JSON one is custom, but similar to StAX. It looks like the Woodstox StAX properties are defined in three blog posts: I think these are the javadocs for what an XMLInputFactory supports: Do we need to set something similar for these? Could probably be done in a separate bug/Pr if you want. This one is getting kind of big. |
+1 with @tuxji comments addressed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the just pushed changes (2 more commits) address all the comments.
I did add tests and changes to XMLTextInfosetInputter and JDOMInfosetInputter to reject DOCTYPE/DTDs in those also.
daffodil-core/src/main/scala/org/apache/daffodil/dsom/DFDLSchemaFile.scala
Show resolved
Hide resolved
@@ -48,8 +47,7 @@ class TestSAXParseUnparseAPI { | |||
val pr = parseXMLReader.getProperty(XMLUtils.DAFFODIL_SAX_URN_PARSERESULT).asInstanceOf[ParseResult] | |||
assertTrue(!pr.isError) | |||
assertEquals(expectedInfoset, scala.xml.XML.loadString(baosParse.toString)) | |||
|
|||
val unparseXMLReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader | |||
val unparseXMLReader = DaffodilSAXParserFactory().newSAXParser().getXMLReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So SAPI and JAPI get to create their own SAXParserFactory instances, SAX parsers, etc. We don't provide constructors for those to construct our secure-enforcing variants, but we include code in the SAPI/JAPI tests showing how to set things up to be secure. This code is not shared with our XMLUtils, because XMLUtils is not part of our external API.
I have gone into Schematron, and updated it to use DaffodilSAXParserFactory, which removes another redundant set of "security parameterization".
@@ -40,6 +40,7 @@ | |||
import org.apache.commons.io.IOUtils; | |||
import org.apache.daffodil.japi.*; | |||
import org.apache.daffodil.japi.infoset.XMLTextInfosetOutputter; | |||
import org.apache.daffodil.xml.XMLUtils; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
import org.xml.sax.SAXNotSupportedException; | ||
import org.xml.sax.XMLReader; | ||
|
||
import javax.xml.XMLConstants; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XMLUtils is not for use in JAPI/SAPI, so we define a separate means to set secure defaults in these tests.
xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", false); | ||
xmlReader.setFeature("http://xml.org/sax/features/external-general-entities", false); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would want to decompose daffodil-lib into smaller libraries - e.g., just the xml utils library, just the schema-compiler libraries (like the OOLAG stuff), etc. before making any of its features be part of a published API.
xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", false) | ||
xmlReader.setFeature("http://xml.org/sax/features/external-general-entities", false) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're not factoring this out as part of our API. Hence this redundant code here, becauuse we're not providing this in an API-consumable method.
f.newSAXParser.getXMLReader | ||
val xr = f.newSAXParser.getXMLReader | ||
XMLUtils.setSecureDefaults(xr) | ||
xr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Change made.
...-processor/src/main/scala/org/apache/daffodil/tdml/processor/DaffodilTDMLDFDLProcessor.scala
Show resolved
Hide resolved
@@ -257,7 +257,7 @@ | |||
* WritableByteChannel wbc = java.nio.channels.Channels.newChannel(os); | |||
* DaffodilUnparseContentHandler unparseContentHandler = dp.newContentHandlerInstance(wbc); | |||
* try { | |||
* XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader(); | |||
* XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix. I have never been a fan of having code snippets in javadoc/scaladoc. It is almost guaranteed to not work. I prefer to put it into examples or tests, and put a URL in the javadoc/scaladoc referring people to that code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the just pushed changes (2 more commits) address all the comments.
I did add tests and changes to XMLTextInfosetInputter and JDOMInfosetInputter to reject DOCTYPE/DTDs in those also.
+1 I reviewed the three new commits. First and third commits indeed address all my comments and the changes look great. Second commit has a slight problem with test coverage. Your @test def testXMLTextInfosetInputterDisallowsDocType() tried several ways to load XML securely with the Woodstox XML library and found the best way was to both turn off features and add a doctype check / throw exception in XMLTextInfosetInputter.scala. I believe you then got inspired to add a doctype check / throw exception to a similar place in JDOMInfosetInputter.scala but that line doesn't get called by any test either because we either need a new test to check JDOM secure loading or JDOM doesn't let programs see the doctype at all when you enable secure XML loading. If the latter is the case, It's probably a good idea to keep the check in case of malicious input and just add a comment saying so along with comments to disable codecov. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Tests that show this error detection for each of * DFDL schemas * included DFDL schemas * imported DFDL schemas * TDML files * XML Infoset files used in TDML tests * Config files used in TDML tests * External variable binding files * XML input to XML Text Infoset Inputters (using Woodstox library) * XML input to JDOM infoset Inputters * XML input to Scala XML Infoset Inputters Eliminating use of DOCTYPE also eliminates any possibility of XML General or Paremter Entities. Consolidated loader/validators of XML and XSD. Everything that loads XML, XSD, TDML, config, external vars, or ".dfdl.xsd", now uses DaffodilXMLLoader to do it. Specific validators that were for Config files or External Variable Binding files are now gone. DaffodilXMLLoader was simplified and made more uniform. Single purpose mixin traits and adapters were eliminated. The validation in DaffodilXMLLoader uses two methods. First it uses the XercesValidator (used by the validation feature). Second it does a validating load with Xerces. This seems to catch/report different validation problems. (Instrumentation to prove this may be worth it, so that we can get rid of redundant work if possible) DFDL schemas are validated by constructing an XML Schema object from them, as well as by loading them and validating against the XML Schema for DFDL schemas. The DaffodilXMLLoader validates (if requested) using a supplied XML Schema. But then always loads using the DaffodilConstructingParser which uses the underlying ConstructingParser - this is needed because in many cases we are dependent on properly handling CDATA regions (e.g., TDML files, test infoset XML files, etc) which Xerces doesn't do properly. I attempted to implement DAFFODIL-288 to validate the infoset XML (before unparsing) also but was unsuccessful, but a TODO DAFFODIL-288 marks the place where that fix goes. New validation is more uniform, and thorough. This caught a number of small issues like missing "tdml:" prefixes on numerous files' testSuite elements. There were various adjustments to accomodate the more strict validation. Changes to SAX due to simple types now being series of Text, Atom, and EntityRef. Various other small fixes to TDML runner to insure no diagnostic errors are being hidden. Fix MS-Windows failure due to CRLF issues. There should be less CRLF sensitivity now. The new unified DaffodilXMLLoader which we use everywhere always normalizes CRLF to LF and isolated CR to LF. This is done in Text, CDATA, and COMMENT objects. The only reason we use the constructing parser now is the behavior around CDATA/PCDATA nodes, which is broken in Xerces. There are tests to characterize this behavior in Xerces so if it does get fixed we can adapt. However, if it did get fixed it would require a mode switch to turn this different behavior on, so we probably just won't notice. Upgraded scala-xml library to version 2.0.0 TDMLRunner no longer gets NPE in one situation. Also documented why we need the 2nd xerces load beyond just the regular XercesValidator call, which is for xsi:schemaLocation. Note added about xsi:noNamespaceSchemaLocation Added comments about useDefaultNamespace in tdml.xsd. Also added a TODO in the code. We really want this to be false by default, but 81 tests in daffodil-test fail if you change that, so not doing in this change set. DAFFODIL-1422, DAFFODIL-1659, DAFFODIL-1816
452a8fe
to
a910410
Compare
If the checks all pass, then this is for review. Note that I do not plan to merge this until after the 3.1.0 release.
Tests that show this error detection for each of
Eliminating use of DOCTYPE also eliminates any possibility of
XML General or Paremter Entities.
Consolidated loader/validators of XML and XSD.
Everything that loads XML, XSD, TDML, config, external vars,
or ".dfdl.xsd", now uses DaffodilXMLLoader to do it.
Specific validators that were for Config files or
External Variable Binding files are now gone.
DaffodilXMLLoader was simplified and made more uniform.
Single purpose mixin traits and adapters were eliminated.
The validation in DaffodilXMLLoader uses two methods.
First it uses the XercesValidator (used by the validation feature).
Second it does a validating load with Xerces. This seems to
catch/report different validation problems. (Instrumentation to
prove this may be worth it, so that we can get rid of redundant work
if possible)
DFDL schemas are validated by constructing an XML Schema
object from them, as well as by loading them and validating
against the XML Schema for DFDL schemas.
The DaffodilXMLLoader validates (if requested) using a supplied XML
Schema. But then always loads using the DaffodilConstructingParser
which uses the underlying ConstructingParser - this is needed because
in many cases we are dependent on properly handling CDATA regions
(e.g., TDML files, test infoset XML files, etc) which Xerces doesn't
do properly.
I attempted to implement DAFFODIL-288 to validate the infoset XML
(before unparsing) also but was unsuccessful so far, but a TODO
DAFFODIL-288 marks the place where that fix goes.
New validation is more uniform, and thorough. This caught a number of
small issues like missing "tdml:" prefixes on numerous files'
testSuite elements.
There were various adjustments to accomodate the
more strict validation.
Fixes to SAX due to simple types now being series of Text, Atom, and
EntityRef.
Various other small fixes to TDML runner to insure no diagnostic
errors are being hidden.
DAFFODIL-1422, DAFFODIL-1659