Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that we get errors if the files contain a DOCTYPE #560

Merged
merged 1 commit into from
May 26, 2021

Conversation

mbeckerle
Copy link
Contributor

@mbeckerle mbeckerle commented May 12, 2021

If the checks all pass, then this is for review. Note that I do not plan to merge this until after the 3.1.0 release.

Tests that show this error detection for each of

  • DFDL schemas
  • included DFDL schemas
  • imported DFDL schemas
  • TDML files
  • XML Infoset files used in TDML tests
  • Config files used in TDML tests
  • External variable binding files

Eliminating use of DOCTYPE also eliminates any possibility of
XML General or Paremter Entities.

Consolidated loader/validators of XML and XSD.

Everything that loads XML, XSD, TDML, config, external vars,
or ".dfdl.xsd", now uses DaffodilXMLLoader to do it.

Specific validators that were for Config files or
External Variable Binding files are now gone.

DaffodilXMLLoader was simplified and made more uniform.
Single purpose mixin traits and adapters were eliminated.

The validation in DaffodilXMLLoader uses two methods.

First it uses the XercesValidator (used by the validation feature).

Second it does a validating load with Xerces. This seems to
catch/report different validation problems. (Instrumentation to
prove this may be worth it, so that we can get rid of redundant work
if possible)

DFDL schemas are validated by constructing an XML Schema
object from them, as well as by loading them and validating
against the XML Schema for DFDL schemas.

The DaffodilXMLLoader validates (if requested) using a supplied XML
Schema. But then always loads using the DaffodilConstructingParser
which uses the underlying ConstructingParser - this is needed because
in many cases we are dependent on properly handling CDATA regions
(e.g., TDML files, test infoset XML files, etc) which Xerces doesn't
do properly.

I attempted to implement DAFFODIL-288 to validate the infoset XML
(before unparsing) also but was unsuccessful so far, but a TODO
DAFFODIL-288 marks the place where that fix goes.

New validation is more uniform, and thorough. This caught a number of
small issues like missing "tdml:" prefixes on numerous files'
testSuite elements.
There were various adjustments to accomodate the
more strict validation.

Fixes to SAX due to simple types now being series of Text, Atom, and
EntityRef.

Various other small fixes to TDML runner to insure no diagnostic
errors are being hidden.

DAFFODIL-1422, DAFFODIL-1659

@mbeckerle
Copy link
Contributor Author

All windows tests are failing currently. A CRLF issue - loaders are likely now standardizing CR to LF where they were perhaps not before. It's somehow related to the use of the constructing parser instead of xerces when loading.

@tuxji
Copy link
Contributor

tuxji commented May 17, 2021

Should I take a look now even though the tests are still failing with one error: [error] Test org.apache.daffodil.xml.TestXMLLoaderWithLocation.testFile1 failed: null, took 0.028 sec?

@mbeckerle
Copy link
Contributor Author

mbeckerle commented May 17, 2021 via email

@mbeckerle mbeckerle changed the title Verify that we get errors if the files contain a DOCTYPE WIP: Verify that we get errors if the files contain a DOCTYPE May 17, 2021
@mbeckerle
Copy link
Contributor Author

Changed back to a WIP - not yet ready for review.

I have discovered there is state sharing causing a multi-thread error. Tests pass when run one-by one in an IDE, or individually under sbt, but when run as a batch under sbt they fail.

@mbeckerle mbeckerle changed the title WIP: Verify that we get errors if the files contain a DOCTYPE Verify that we get errors if the files contain a DOCTYPE May 17, 2021
@mbeckerle
Copy link
Contributor Author

If the checks all pass, then this is ready for review.

@tuxji
Copy link
Contributor

tuxji commented May 21, 2021

Should I review now or wait? I wanted to wait until all the checks pass (which they do now), but your comments about some tests needing unexplainable changes makes me wonder if we should wait for more investigation.

Copy link
Contributor Author

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Responded to all the review comments. Quite a bit of cleanup here.

There are a couple of CLI tests where I left in LOUD comments to review them, because I'm not sure why I had to change them at all.

class DaffodilConstructingLoader(uri: URI,
errorHandler: org.xml.sax.ErrorHandler,
addPositionAttributes: Boolean = false,
normalizeCRLFtoLF: Boolean = true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To centralize the loader in one place so that one can insure it's always set up to be secure, you then need that one loader to support different modes of operation.

We only need the line/file position stuff when we're loading DFDL schemas, and TDML files since those embed DFDL Schemas. Other XML loading (like TDML infoset files) don't need that.

The normalize CRLF to LF stuff should always be true, as that is the behavior of the regular xerces loader, but was not the case for the ConstructingParser which we need to use because we need CDATA nodes preserved. There was some code depending on this non-CRLF-normalizing behavior that had crept in. I needed this flag to figure things out.

At this point, I could remove the normalizeCRLFtoLF flag, because it is only used by tests that show it works.
Given how non-standard loading XML, while preserving CRLFs or CRs is, I should probably remove it.

I am leaving this feature in for now, because this is, in effect, an incompatible change to the TDML runner. It's possible people other than us have TDML, and that TDML may be depending on the TDML Runner loading the tdml file, or infoset XML files, with CRLFs preserved.

}
finalVersion
}
val othersmatch = otherStrings.forall { case st: String =>
val stIsEmpty = st == null || res == ""
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll fix this and see if anything breaks.

val dp =
val dp = {
// Always serialize the parser/unparser objects.
// This insures nothing is left uncomputed by lazy evaluation.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this from here. It now only saves and reloads if useSerializedProcessor.
This is the compiler's problem to insure that everything is fully evaluated, not code out here using the data processor.

project/Dependencies.scala Outdated Show resolved Hide resolved
@mbeckerle
Copy link
Contributor Author

mbeckerle commented May 21, 2021 via email

@mbeckerle
Copy link
Contributor Author

This change set also fixes DAFFODIL-1816

override def newXMLReaderInstance: DFDL.DaffodilParseXMLReader = new DaffodilParseXMLReader(this)
override def newXMLReaderInstance: DFDL.DaffodilParseXMLReader = {
val xrdr = new DaffodilParseXMLReader(this)
XMLUtils.setSecureDefaults(xrdr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on just removing this call? We aren't getting a random XMLReader here based on some factory that happens to be on the classpath. We know it's a Daffodil specific XML reader, and we know it's secure from these XML issues. Removing this would potentially mean we could remove the new logic added in set/getFeature to accept features that we don't really support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to finesse this is to make DaffodilParseXMLReader use a factory and have a private constructor. That way how it is initialized after construction need not be the caller's responsibility. I will change this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue that calling setSecureDefaults is no one's responsiblity, including ours. So a factory just adds extra API complexity. The DaffodilXMLReader doesn't actually read XML (only implements the XMLReader API)--there are no concerns about reading XML securely, so no one, including us, ever needs to set it in a secure mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I think we need scaladoc to point out that this is by definition secure if we remove the ability to set the secure features, otherwise someone's going to try it, and end up not liking that they can't set it secure.

@stevedlawrence
Copy link
Member

There was a comment I found in an email but can't currently find it in github, whree @mbeckerle said:

Gone. However, I would say, that this pluggable XML readers stuff feels like stuff of a bygone era when everyone thought there would be competition around XML libraries, and pluggable XML technologies from vendors, etc. This seems silly in retrospect. But maybe it is needed for backward bug-for-bug compatibility in some cases.

For us, I don't think pluggable XMLReader makes any sense. We're far too sensitive to exactly their characteristics.

think we should explicitly check for the exact XMLReader classes we support, and error if it isn't that.

Thoughts on this check?

I'm wondering if we should just be explicitly creating the XML Reader that we are expecting? As you said, we depend on very specific behavior from a specif XMLReader, so using something like SAXParserFactory.newInstance().newSAXParser().getXMLReader to get some XMLReader seems like a mistake. I'm wondering if there's a way to directly create a Xerces XMLReader. Maybe we should use the overloaded newInstance function that takes a class name? Or maybe we can directly create the right class ourselves, and avoid this pluggable logic? Then we really can assume that we have a certain XMLReader.

Would it then even be possible to have a daffodil-lib function that gets the things XMLReader stuff we need and set the secure defaults, so there's only one place we need to call setSecureDefaults, with the assumptino that everything gets the XMLReader from our new daffodil-lib function?

@mbeckerle
Copy link
Contributor Author

re: constructing the specific SAXParser class we need. I've introduced a new DaffodilSAXParserFactory, which insures that the SAXParser one gets is exactly the Xerces one. Also insures that its XMLReader is secure.

Copy link
Contributor

@tuxji tuxji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

We're getting close to the end. I think just a few minor cleanups are left.

@stevedlawrence
Copy link
Member

Had one more though, this mostly changes how we parse things like TDML files, DFDL Schemas, etc. But I think it leaves out some of the InfosetInputters? I assume the Scala, JDOM, and W3CDOM don't have these security issues since the XML was parsed into these data structures and so are safe to use. But the XMLTextInfosetInputter and JSON use the woodstox and jackson libraries, respectively. Neither of these use an XMLReader of parse. The XML one is StAX which uses a different API. And the JSON one is custom, but similar to StAX.

It looks like the Woodstox StAX properties are defined in three blog posts:
https://cowtowncoder.medium.com/configuring-woodstox-xml-parser-basic-stax-properties-39bdf88c18ec
https://cowtowncoder.medium.com/configuring-woodstox-xml-parser-stax2-properties-c80ef5a32ef1
https://cowtowncoder.medium.com/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

I think these are the javadocs for what an XMLInputFactory supports:
https://docs.oracle.com/javase/6/docs/api/javax/xml/stream/XMLInputFactory.html
https://fasterxml.github.io/woodstox/javadoc/6.2/com/ctc/wstx/api/WstxInputProperties.html

Do we need to set something similar for these? Could probably be done in a separate bug/Pr if you want. This one is getting kind of big.

@stevedlawrence
Copy link
Member

+1 with @tuxji comments addressed

Copy link
Contributor Author

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the just pushed changes (2 more commits) address all the comments.

I did add tests and changes to XMLTextInfosetInputter and JDOMInfosetInputter to reject DOCTYPE/DTDs in those also.

@@ -48,8 +47,7 @@ class TestSAXParseUnparseAPI {
val pr = parseXMLReader.getProperty(XMLUtils.DAFFODIL_SAX_URN_PARSERESULT).asInstanceOf[ParseResult]
assertTrue(!pr.isError)
assertEquals(expectedInfoset, scala.xml.XML.loadString(baosParse.toString))

val unparseXMLReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader
val unparseXMLReader = DaffodilSAXParserFactory().newSAXParser().getXMLReader
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So SAPI and JAPI get to create their own SAXParserFactory instances, SAX parsers, etc. We don't provide constructors for those to construct our secure-enforcing variants, but we include code in the SAPI/JAPI tests showing how to set things up to be secure. This code is not shared with our XMLUtils, because XMLUtils is not part of our external API.

I have gone into Schematron, and updated it to use DaffodilSAXParserFactory, which removes another redundant set of "security parameterization".

@@ -40,6 +40,7 @@
import org.apache.commons.io.IOUtils;
import org.apache.daffodil.japi.*;
import org.apache.daffodil.japi.infoset.XMLTextInfosetOutputter;
import org.apache.daffodil.xml.XMLUtils;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

import org.xml.sax.SAXNotSupportedException;
import org.xml.sax.XMLReader;

import javax.xml.XMLConstants;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XMLUtils is not for use in JAPI/SAPI, so we define a separate means to set secure defaults in these tests.

xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
xmlReader.setFeature("http://xml.org/sax/features/external-general-entities", false);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would want to decompose daffodil-lib into smaller libraries - e.g., just the xml utils library, just the schema-compiler libraries (like the OOLAG stuff), etc. before making any of its features be part of a published API.

xmlReader.setFeature("http://xml.org/sax/features/external-parameter-entities", false)
xmlReader.setFeature("http://xml.org/sax/features/external-general-entities", false)
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not factoring this out as part of our API. Hence this redundant code here, becauuse we're not providing this in an API-consumable method.

f.newSAXParser.getXMLReader
val xr = f.newSAXParser.getXMLReader
XMLUtils.setSecureDefaults(xr)
xr
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Change made.

project/Dependencies.scala Outdated Show resolved Hide resolved
@@ -257,7 +257,7 @@
* WritableByteChannel wbc = java.nio.channels.Channels.newChannel(os);
* DaffodilUnparseContentHandler unparseContentHandler = dp.newContentHandlerInstance(wbc);
* try {
* XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader();
* XMLReader xmlReader = SAXParserFactory.newInstance().newSAXParser().getXMLReader;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix. I have never been a fan of having code snippets in javadoc/scaladoc. It is almost guaranteed to not work. I prefer to put it into examples or tests, and put a URL in the javadoc/scaladoc referring people to that code.

Copy link
Contributor Author

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the just pushed changes (2 more commits) address all the comments.

I did add tests and changes to XMLTextInfosetInputter and JDOMInfosetInputter to reject DOCTYPE/DTDs in those also.

@tuxji
Copy link
Contributor

tuxji commented May 26, 2021

+1

I reviewed the three new commits. First and third commits indeed address all my comments and the changes look great. Second commit has a slight problem with test coverage. Your @test def testXMLTextInfosetInputterDisallowsDocType() tried several ways to load XML securely with the Woodstox XML library and found the best way was to both turn off features and add a doctype check / throw exception in XMLTextInfosetInputter.scala. I believe you then got inspired to add a doctype check / throw exception to a similar place in JDOMInfosetInputter.scala but that line doesn't get called by any test either because we either need a new test to check JDOM secure loading or JDOM doesn't let programs see the doctype at all when you enable secure XML loading. If the latter is the case, It's probably a good idea to keep the check in case of malicious input and just add a comment saying so along with comments to disable codecov.

Copy link
Member

@stevedlawrence stevedlawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Tests that show this error detection for each of
* DFDL schemas
* included DFDL schemas
* imported DFDL schemas
* TDML files
* XML Infoset files used in TDML tests
* Config files used in TDML tests
* External variable binding files
* XML input to XML Text Infoset Inputters (using Woodstox library)
* XML input to JDOM infoset Inputters
* XML input to Scala XML Infoset Inputters

Eliminating use of DOCTYPE also eliminates any possibility of
XML General or Paremter Entities.

Consolidated loader/validators of XML and XSD.

Everything that loads XML, XSD, TDML, config, external vars,
or ".dfdl.xsd", now uses DaffodilXMLLoader to do it.

Specific validators that were for Config files or
External Variable Binding files are now gone.

DaffodilXMLLoader was simplified and made more uniform.
Single purpose mixin traits and adapters were eliminated.

The validation in DaffodilXMLLoader uses two methods.

First it uses the XercesValidator (used by the validation feature).

Second it does a validating load with Xerces. This seems to
catch/report different validation problems. (Instrumentation to
prove this may be worth it, so that we can get rid of redundant work
if possible)

DFDL schemas are validated by constructing an XML Schema
object from them, as well as by loading them and validating
against the XML Schema for DFDL schemas.

The DaffodilXMLLoader validates (if requested) using a supplied XML
Schema. But then always loads using the DaffodilConstructingParser
which uses the underlying ConstructingParser - this is needed because
in many cases we are dependent on properly handling CDATA regions
(e.g., TDML files, test infoset XML files, etc) which Xerces doesn't
do properly.

I attempted to implement DAFFODIL-288 to validate the infoset XML
(before unparsing) also but was unsuccessful, but a TODO
DAFFODIL-288 marks the place where that fix goes.

New validation is more uniform, and thorough. This caught a number of
small issues like missing "tdml:" prefixes on numerous files'
testSuite elements.
There were various adjustments to accomodate the
more strict validation.

Changes to SAX due to simple types now being series of Text, Atom, and
EntityRef.

Various other small fixes to TDML runner to insure no diagnostic
errors are being hidden.

Fix MS-Windows failure due to CRLF issues. There should be less CRLF
sensitivity now.  The new unified DaffodilXMLLoader which we use
everywhere always normalizes CRLF to LF and isolated CR to LF.
This is done in Text, CDATA, and COMMENT objects.

The only reason we use the constructing parser now is the behavior
around CDATA/PCDATA nodes, which is broken in Xerces. There are tests
to characterize this behavior in Xerces so if it does get fixed we can
adapt. However, if it did get fixed it would require a mode switch to
turn this different behavior on, so we probably just won't notice.

Upgraded scala-xml library to version 2.0.0

TDMLRunner no longer gets NPE in one situation.

Also documented why we need the 2nd xerces load beyond just the
regular XercesValidator call, which is for xsi:schemaLocation.

Note added about xsi:noNamespaceSchemaLocation

Added comments about useDefaultNamespace in tdml.xsd. Also
added a TODO in the code. We really want this to be false by
default, but 81 tests in daffodil-test fail if you change that, so not
doing in this change set.

DAFFODIL-1422, DAFFODIL-1659, DAFFODIL-1816
@mbeckerle mbeckerle merged commit 62ac104 into apache:master May 26, 2021
@mbeckerle mbeckerle deleted the daf-1422-doctype branch May 26, 2021 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants