Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Name predicates: which XML version? #607

Open
rossabaker opened this issue Jun 18, 2022 · 2 comments
Open

Name predicates: which XML version? #607

rossabaker opened this issue Jun 18, 2022 · 2 comments

Comments

@rossabaker
Copy link

I am trying to implement a Scalacheck XML generator that round trips through writing and parsing. I've run into a discrepancy between the character sets in scala-xml and the JVM internals. Is it expected that scala-xml's alphabet targets a specific version of the XML spec? I'm finding that the scala-xml alphabet does not match the JVM's idea of XML 1.0 nor XML 1.1.

I tried to make this a scala-cli script, but I can't get it to accept the com.sun.org imports. I have to run this on Java 8 (specifically, I used 1.8.0_292) to avoid trouble with the module system.

import com.sun.org.apache.xml.internal.utils.XMLChar
import com.sun.org.apache.xml.internal.utils.XML11Char
import scala.xml.Utility

object Chars extends App {
  val allChars = (Char.MinValue to Char.MaxValue)

  val charSets = Map(
    "scala-xml-start"  -> ((c: Char) => Utility.isNameStart(c)),
    "xml-1.0-start"    -> ((c: Char) => XMLChar.isNameStart(c)),
    "xml-1.1-start"    -> ((c: Char) => XML11Char.isXML11NameStart(c)),

    "scala-xml"  -> ((c: Char) => Utility.isNameChar(c)),
    "xml-1.0"    -> ((c: Char) => XMLChar.isName(c)),
    "xml-1.1"    -> ((c: Char) => XML11Char.isXML11Name(c)),
  )

  def compare(a: String, b: String) = {
    val diff = allChars.filter(charSets(a)).filterNot(charSets(b))
    println(s"In ${a}, not ${b}: ${diff.size}")
    println(diff.take(10))
    println()
  }

  compare("scala-xml-start", "xml-1.0-start")
  compare("xml-1.0-start", "scala-xml-start")

  compare("scala-xml-start", "xml-1.1-start")
  compare("xml-1.1-start", "scala-xml-start")

  compare("scala-xml", "xml-1.0")
  compare("xml-1.0", "scala-xml")

  compare("scala-xml", "xml-1.1")
  compare("xml-1.1", "scala-xml")
}

scala-xml

In scala-xml-start, not xml-1.0-start: 13800
Vector(ª, µ, º, IJ, ij, Ŀ, ŀ, ʼn, ſ, DŽ)

In xml-1.0-start, not scala-xml-start: 11
Vector(ʻ, ʼ, ʽ, ʾ, ʿ, ˀ, ˁ, ՙ, ۥ, ۦ)

In scala-xml-start, not xml-1.1-start: 3
Vector(ª, µ, º)

In xml-1.1-start, not scala-xml-start: 5700
Vector(ʰ, ʱ, ʲ, ʳ, ʴ, ʵ, ʶ, ʷ, ʸ, ʹ)

In scala-xml, not xml-1.0: 14993
Vector(ª, µ, º, IJ, ij, Ŀ, ŀ, ʼn, ſ, DŽ)

In xml-1.0, not scala-xml: 4
Vector(·, ۝, ۞, ℮)

In scala-xml, not xml-1.1: 3
Vector(ª, µ, º)

In xml-1.1, not scala-xml: 4021
Vector(˂, ˃, ˄, ˅, ˒, ˓, ˔, ˕, ˖, ˗)

I think I can limit my generators to a characters that pass both the JVM's and scala-xml's predicate, but I'm curious if this difference is known and intentional. Thanks!

@ashawley
Copy link
Member

I'm sure it's just XML 1.0 and not 1.1. I'm also not surprised it's inconsistent with the spec. The implementation for isNameChar in scala-xml hasn't fundamentally changed in 20 years. There's probably not someone around to explain the rationale for the differences.

@rossabaker
Copy link
Author

I did some homework. tl;dr:

  • scala-xml doesn't match the spec it references in the scaladoc
  • there are Type 1 and Type 2 errors vs. all recent specs
  • the definition is dynamic with the JVM version

I am willing to update docs or synchronize the predicates with a particular XML standard.

--

What's defined in scalacheck-xml is fully consistent with the JDK's XMLChar. This is XML 1.0, Fourth Edition. I found an ancient rant about Fifth Edition, which is the status quo in Xerces.

The scaladoc on TokenParserTests (which Utility extends) refers to 1.0's Appendix B, which are also the Fourth Edition rules, now "orphaned" in Fifth Edition. That spec is based on Unicode 2.0 (JDK 1.1 era!), with some complicated exceptions.

  • ª (0xaa) is excluded by the spec because it has "a font or compatibility decomposition".
  • ʻ (0x2bb) is included by the spec "because the property file classifies them as Alphabetic".
  • I could go on, but it's incredibly arcane, and the deviations I found probably all look similar this.

Furthermore, since scala-xml just delegates to Unicode character types, its predicates are a function of the JVM version. XML 1.0 Fifth Edition's "intention is to be inclusive rather than exclusive, so that writing systems not yet encoded in Unicode can be used in XML names," but it's still a fixed set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants