Skip to content

Name predicates: which XML version? #607

Open
@rossabaker

Description

@rossabaker

I am trying to implement a Scalacheck XML generator that round trips through writing and parsing. I've run into a discrepancy between the character sets in scala-xml and the JVM internals. Is it expected that scala-xml's alphabet targets a specific version of the XML spec? I'm finding that the scala-xml alphabet does not match the JVM's idea of XML 1.0 nor XML 1.1.

I tried to make this a scala-cli script, but I can't get it to accept the com.sun.org imports. I have to run this on Java 8 (specifically, I used 1.8.0_292) to avoid trouble with the module system.

import com.sun.org.apache.xml.internal.utils.XMLChar
import com.sun.org.apache.xml.internal.utils.XML11Char
import scala.xml.Utility

object Chars extends App {
  val allChars = (Char.MinValue to Char.MaxValue)

  val charSets = Map(
    "scala-xml-start"  -> ((c: Char) => Utility.isNameStart(c)),
    "xml-1.0-start"    -> ((c: Char) => XMLChar.isNameStart(c)),
    "xml-1.1-start"    -> ((c: Char) => XML11Char.isXML11NameStart(c)),

    "scala-xml"  -> ((c: Char) => Utility.isNameChar(c)),
    "xml-1.0"    -> ((c: Char) => XMLChar.isName(c)),
    "xml-1.1"    -> ((c: Char) => XML11Char.isXML11Name(c)),
  )

  def compare(a: String, b: String) = {
    val diff = allChars.filter(charSets(a)).filterNot(charSets(b))
    println(s"In ${a}, not ${b}: ${diff.size}")
    println(diff.take(10))
    println()
  }

  compare("scala-xml-start", "xml-1.0-start")
  compare("xml-1.0-start", "scala-xml-start")

  compare("scala-xml-start", "xml-1.1-start")
  compare("xml-1.1-start", "scala-xml-start")

  compare("scala-xml", "xml-1.0")
  compare("xml-1.0", "scala-xml")

  compare("scala-xml", "xml-1.1")
  compare("xml-1.1", "scala-xml")
}

scala-xml

In scala-xml-start, not xml-1.0-start: 13800
Vector(ª, µ, º, IJ, ij, Ŀ, ŀ, ʼn, ſ, DŽ)

In xml-1.0-start, not scala-xml-start: 11
Vector(ʻ, ʼ, ʽ, ʾ, ʿ, ˀ, ˁ, ՙ, ۥ, ۦ)

In scala-xml-start, not xml-1.1-start: 3
Vector(ª, µ, º)

In xml-1.1-start, not scala-xml-start: 5700
Vector(ʰ, ʱ, ʲ, ʳ, ʴ, ʵ, ʶ, ʷ, ʸ, ʹ)

In scala-xml, not xml-1.0: 14993
Vector(ª, µ, º, IJ, ij, Ŀ, ŀ, ʼn, ſ, DŽ)

In xml-1.0, not scala-xml: 4
Vector(·, ۝, ۞, ℮)

In scala-xml, not xml-1.1: 3
Vector(ª, µ, º)

In xml-1.1, not scala-xml: 4021
Vector(˂, ˃, ˄, ˅, ˒, ˓, ˔, ˕, ˖, ˗)

I think I can limit my generators to a characters that pass both the JVM's and scala-xml's predicate, but I'm curious if this difference is known and intentional. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions