Skip to content

neomantic-zz/gedcomxml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RELAX NG/Schematron Schema for GEDCOM 5.5 XML (version 0.1)

Chad Albers
     __________________________________________________________________

Background

   The Church of the Latter-day Saints' [1]GEDCOM 5.5 Standard provides a
   way to format, digitally store, and transfer genealogical data in a
   standardized, human readable text file. Over the years, new standards,
   using the [2]Extensible Markup Language (XML), have been proposed. The
   web page [3]here provides a list of these proposals.

   None of these proposals have gained widespread acceptance, and GEDCOM
   5.5 remains the format that nearly all genealogical programs export to
   and that users still exchange their genealogical data using.

GEDCOM 5.5 XML

   XML proponents can console themselves with the fact that the GEDCOM 5.5
   standard bares some resemblance to a XML markup language. Despite its
   non-standard character set (Ansel), it is human readable text. Instead
   of XML elements, it delimits genealogical data between a text tag and
   character return and line feed or both; the tag precedes the data.
   Finally, GEDCOM 5.5 tags can be nested under other tags, resembling the
   parent and child node trees of XML elements.

   Given these similarities, it is easy to envision an one-to-one
   translation of GEDCOM 5.5 into a XML language. GEDCOM 5.5 tags could be
   XML elements; XML's greater than '>' and less than '<' characters could
   surround the tags, converting them into elements. Open and closed
   elements could delimit genealogical data. To differentiate it from
   other proposals, such a XML markup language could be called "GEDCOM 5.5
   XML."

   Two XML markup languages have come close to describing GEDCOM 5.5 XML:
   GedML and GeniML.

GedML

   In [4]1998, Michael H. Kay proposed [5]GedML. A XML Schema, a DTD, for
   this proposal was [6]released. In GedML's DTD, most of the GEDCOM 5.5
   tags have been translated into XML elements with the same name; e.g.,
   INDI tags are simply <INDI> elements. Other tags, though, have been
   translated; e.g., SOUR tags are <Source> elements. GedML is also
   missing the <TRLR> element which replicates the GEDCOM 5.5 TRLR tag.
   This tag indicates the end of a GEDCOM 5.5 file. It is absent from
   GedML because the closed document root element, </GED>, performs the
   same function, rendering a <TRLR/> element superfluous. Considering
   these differences, as far as the DTD is concerned, GedML is not an
   one-to-one translation of GEDCOM 5.5 into XML.

   Kay, however, also released a Java language program that translates
   GEDCOM 5.5 files into XML documents. The markup language of these
   documents does not have a XML schema (DTD, W3C XML Schema, RELAX NG).
   However, a review of the output of his program reveals that all GEDCOM
   5.5 tags have been translated to XML elements with the same name, just
   like GEDCOM 5.5 XML. There is, though, one missing element: the one
   corresponding to the TRLR tag. Given this flaw, this version of GedML
   also fails to be an one-to-one translation of GEDCOM 5.5 into XML.

GeniML

   Jerry Fitzpatrick of Software Renovation Corporation developed
   [7]GeniML. Like Kay, he released a Windows [8]program to convert GEDCOM
   5.5 to XML. No schema for this markup language has been released.
   However, sample output included with the converter shows that GeniML
   appears to be an one-to-one translation of GEDCOM 5.5 into XML. GeniML
   even includes the <TRLR> element. There are, however, some differences;
   for example, GeniML delimits surname data using a <SURNAME> element,
   when GEDCOM 5.5 uses a SURN tag for the same purpose. Considering this
   difference and many others, GeniML, like GedML, is not an one-to-one
   translation of GEDCOM 5.5 into XML markup.

RELAX NG/Schematron Schema for GEDCOM 5.5 XML

   GEDCOM 5.5 XML differentiates itself from GedML and GeniML because it
   attempts to replicate the LDS's GEDCOM 5.5 standard using XML markup.
   Without exception, all GEDCOM 5.5 tags should correspond to XML
   elements with the same name; all tags should be preserved; the
   parent-child relationships between the tags and elements should
   parallel one another; and all data delimited by the elements should
   fall within the strict guidelines of the standard.

   To define GEDCOM 5.5 XML, a schema is needed. I am releasing one
   written in [9]RELAX NG/[10]Schematron. The RELAX NG XML version of the
   schema is contained in the file called "gedcom55XML.rng". The RELAX NG
   [11]compact syntax version is in the file called "gedcom55XML.rnc".

   Like any XML schema, it can be used to validate, in the XML sense, a
   GEDCOM 5.5 file that has been converted to a GEDCOM 5.5 XML document;
   i.e., it can be used to determine if the document is both "well-formed"
   and "valid." This means that it can be used to test if the GEDCOM 5.5
   tags (when converted to XML elements) are correctly nested within each
   other as prescribed by the LDS's GEDCOM 5.5 standard. It can also be
   used to test if the data, delimited by the elements, follows the strict
   guidelines prescribed by the standard.

GEDCOM 5.5 XML URI

   http://www.neomantic.com/gedcom55XML is the namespace for the root
   element of a GEDCOM 5.5 XML document.

Schema Versions and Download Locations

   The schema described in this document is version 0.1. It was originally
   written in RELAX NG's XML markup.
     * The XML version is located at
       [12]http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML.rng.
     * A RELAX NG compact syntax version, produced using the [13]trang
       translator, is also being released. It is located at
       [14]http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML.rnc. (The
       compact syntax version has not been tested.)
     * Both the XML and compact syntax 0.1 schema files can also be found
       bundled with this documentation at
       [15]http://www.neomantic.com/gedcom5.5XML/0.1/gedcom55XML-0.1.tar.g
       z.
     * For verification purposes, I have signed this tar, gzipped archive
       with my [16]gnupg public key located [17]here. The signature of
       gedcom55XML-0.1.tar.gz is located at
       [18]http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML-0.1.tar.gz
       .sign.

Schema License

   The source code for both gedcom55XML.rng and gedcom55XML.rnc is
   released under the [19]GNU General Public License Version 2 (GPL). The
   full text of this license can be found in a file called "gpl-2.0" in
   gedcom55XML-0.1.tar.gz.

Schema Updates

   Hyperlinks to the most up-to-date version of the schema are located at
   [20]http://www.neomantic.com/gedcom55XML.

Schema Limitations

   Ideally, the GEDCOM 5.5 XML RELAX NG/Schematron schema in
   gedcom55XML.rng would completely mirror the GEDCOM 5.5 standard. There
   are, however, nine places where it does not fully replicate the
   standard. This is why the version number of gedcom55XML.rng/c is "0.1".
   When the schema captures 100 percent of the standard, the version
   number will be incremented to "1.0".
    1. The schema does not describe valid content of a GEDCOM 5.5
       DATE_VALUE. The possible combinations of DATE_VALUEs are simply too
       numerous to handle using Schematron alone. The schema only
       prescribes that the length of character data of a DATE_VALUE be
       between 1 and 35 characters.
    2. The schema does not describe all instances of EXACT_DATE data. It
       fails to describe the EXACT_DATE values of TRANSMISSION_DATE and
       CHANGE_DATE. When the Schematron rule is able to describe the
       EXACT_DATE value, it only uses the English MONTH abbreviations;
       i.e., JAN, FEB, MAR, etc..
    3. Normally, the regions in the PLAC tag are separated by commas
       (e.g., Kansas City, Jackson County, Missouri, USA) or,
       alternatively, the delimiter specified in the HEAD.PLAC.FORM tag.
       gedcom55XML.rng ignores the delimiter of PLAC data.
    4. The given name and surname of an individual are usually delimited
       in the NAME tag by forward slashes (e.g., Joseph/Smith/). The
       Schematron rules do not prescribe this format.
    5. GEDCOM 5.5 allows for user-defined tags, even though it discourages
       their usage. The schema does not allow user-defined tags.
    6. The schema does not describe the content of SOURCE_CITATION's EVEN
       tag's EVENT_TYPE_CITED_FROM, even though the GEDCOM 5.5 standard
       specifies a restricted set of values.
    7. It also does not describe the content of SOURCE_RECORD's EVEN tag's
       EVENTS_RECORDED, even though again the GEDCOM 5.5 standard
       specifies a restricted set of values.
    8. The AGE tag permits values such as CHILD, INFANT, STILLBORN, along
       with 'y','m','d' which respectively signify year, month, and day.
       Following the specification strictly, these values should be case
       insensitive. In gedcom55XML.rng, they are case sensitive,
       defaulting to the lowercase values.
    9. The MONTH values of DATE_EXACT should also be case insensitive, but
       in the current schema they are not. The defaults are the upper-case
       values.

   Most of these shortcoming are due to difficulties in describing what
   would be considered, in XML terms, mixed-content elements.

Using gedcom55XML.rng to Validate a GEDCOM 5.5 XML Document

   If a GEDCOM 5.5 file is converted into a GEDCOM 5.5 XML document, the
   RELAX NG/Schematron schema contained in gedcom55XML.rng provides a way
   to test the validity, as describe above, of the XML document.

   The instructions below describe how to do so using several external
   programs available on the internet. The instructions follow several
   conventions:
     * family.ged represents a GEDCOM 5.5 file.
     * family.xml represents the family.ged file translated into a GEDCOM
       5.5 XML document.
     * Text sandwiched between brackets [ ] indicates variables that
       depend upon your computer's environment.

   In rough outline, the process has two steps:
    1. Convert the GEDCOM 5.5 file to a GEDCOM 5.5 XML document.
    2. Validate the GEDCOM 5.5 XML document using gedcom55XML.rng.

Convert the GEDCOM 5.5 File to a GEDCOM 5.5 XML Document

   Currently there is no quick and easy way to convert a GEDCOM 5.5 (or a
   .ged) file into a GEDCOM 5.5 XML (or .xml) file. I know of only one way
   to accomplish it now, and it requires some programming skills. To
   perform the conversion, you will need to use a Java program released by
   Michael Kay which, used with his famous saxon XML parser, will perform
   the conversion.
    1. Download Kay's [21]source code and unzip it in a location of your
       choosing. Remember the path to this location. You will need to know
       this path in a later step, and it will be referred to using the
       variable [path-to-gedml-classes].
    2. Find the files called "GedcomParser.java" and "GedcomToXml.xsl" in
       Kay's source code.
    3. Compile the file GedcomParser.java using your favorite Java
       distribution's compiler - javac. This will produce a class file
       called "GedcomParser.class". The command is as follows:
javac GedcomParser.java

    4. Download Kay's [22]saxon parser, install it, and remember its
       location, which will be referred to below using the [path-to-saxon]
       variable. (It may already be installed on your system; on my Debian
       GNU/Linux system it was located at /usr/share/java/saxon.jar.)
    5. Convert family.ged to family.xml by issuing the following command
       in your terminal:
java -cp [path-to-saxon]/saxon.jar:[path-to-gedml-classes] com.icl.saxon.StyleSh
eet -x GedcomParser -o family.xml family.ged [path-to-gedml-classes]/GedcomToXml
.xsl

    6. The output of this command, family.xml, will be a XML version of
       family.ged. family.xml will be a near perfect reproduction of a
       GEDCOM 5.5 into GEDCOM 5.5 XML. The only problem with the output is
       that it fails to add the <TRLR/> element before the </GED> element
       at the end of the document. Kay's GedcomParser removes the <TRLR/>
       element, because the parser was intended for a version of GedML.

Validate the GEDCOM 5.5 XML Document

   Validating a XML file with a RELAX NG/Schematron schema is a
   complicated process. The complication arises because, to fully validate
   your GEDCOM 5.5 XML file using gedcom55XML.rng, a RELAX NG validator
   and a Schematron validator must be used. (The entire process is
   described [23]here). Two methods for validation are described below.

Method 1 - Simultaneously Validation

   One way of validating a GEDCOM 5.5 XML file with gedcom55XML.rng is by
   using [24]Topologi's [25]Open Source Schematron Java classes. Their zip
   file [26]here contains all the binaries and scripts needed to validate
   a XML file with Schematron rules embedded in RELAX NG patterns. To do
   so, type the following command in the directory called "Schematron"
   that is produced after unzipping the downloaded file above:
java -cp ./Saxon/saxon.jar:./Java/Schematron.jar:./Jing/jing.jar com.topologi.sc
hematron.EmbRNGValidator family.xml gedcom55XML.rng

   When this command detects errors in family.xml using the RELAX NG
   schema, it stops at the first error and indicates the line it is on.
   You will need to correct the XML markup of family.xml (and the
   corresponding tags in family.ged), and then run the command above again
   to find the remaining errors using the RELAX NG schema.

   After the command above reveals the error checked against the RELAX NG
   schema, it then proceeds to check the Schematron rules. When it finds
   an error, it will not tell you which line has a problem, but only the
   text of the problematic line.

   In either the case of RELAX NG errors or Schematron errors, the command
   does not tell you what exactly is wrong with the line. It is up to you
   to figure out why the line is incorrect by referring to the LDS's
   GEDCOM 5.5 standard.

Method 2 - Sequentially Validation

    1. Validate the GEDCOM 5.5 XML file using the RELAX NG portion of the
       schema.
       RELAX NG validation can be performed by a Open Source Java program
       called "jing". This command line application can be downloaded
       [27]here. jing [28]supposedly can validate a XML file against the
       Schematron rules embedded in the RELAX NG schema. However, I have
       been unable to get jing to recognized the Schematron patterns in
       gedcom55XML.rng.
       To perform the RELAX NG validation, type the following command
       line:
java -jar [path-to-jing]/jing.jar gedcom55XML.rng family.xml

       The results will pipe to the terminal's standard output. jing will
       report all the errors that it encounters. It will not report
       exactly what is wrong, but it will report what lines have errors in
       them. If jing reports no errors, your GEDCOM 5.5 XML document is
       valid, both in terms of its structure and most of its content.
    2. Perform the Schematron validation.
       Step one does not test the mixed-content inherent in the GEDCOM 5.5
       (and GEDCOM 5.5 XML) object model. To test the mixed-content
       elements, the XML document must be tested against the Schematron
       rules in gedcom55XML.rng. It takes two steps to do so.
         a. The Schematron rules must first be extracted from the RELAX NG
            schema using the style sheet, RNG2Schtrn.xsl, available
            [29]here. For the sake of convenience, I have extracted the
            Schematron rules from gedcom55XML.rng. They are in the file
            called "gedcom55XML.sch". I am distributing this source code
            under the GNU General Public License Version 2.
         b. The family.xml file must be passed through the Schematron
            rules. To do so, we will use Topologoi's Schematron Java
            classes again, the same ones used in Method 1 above. Go to the
            "Schematron" directory again and type the following command:
java -cp ./Saxon/saxon.jar:./Java/Schematron.jar:./Jing/jing.jar com.topologi.sc
hematron.SchtrnValidator family.xml gedcom55XML.sch

       If this command reveals any errors in family.xml, it will not tell
       you the line it occurs on, but it will indicate the data that fails
       to pass the Schematron rules.

Documentation License

   This document is released under the [30]GNU Free Documentation License
   Version 1.2. The full text of this license is found in the file called
   "fdl.txt" released with gedcom55XML-0.1.tar.gz. It can also be located
   at [31]http://www.neomantic.com/gedcom55XML/0.1/README.html.

Contact

   Please direct questions or requests for more information to
   <[32]chad@neomantic.com>. Corrections, suggestions, bug reports, and
   patches are welcome as well.
   5/1/2008

References

   1. http://homepages.rootsweb.com/~pmcbride/gedcom/55gctoc.htm
   2. http://www.w3.org/XML/
   3. http://xml.coverpages.org/genealogy.html
   4. http://lists.xml.org/archives/xml-dev/199804/msg00282.html
   5. http://homepage.ntlworld.com/michael.h.kay/gedml/
   6. http://users.breathe.com/mhkay/gedml/dtd.html
   7. http://www.softwarerenovation.com/igenie/GeniML.aspx
   8. http://www.softwarerenovation.com/igenie/GeniMLConverter.zip
   9. http://www.oasis-open.org/committees/relax-ng/spec-20011203.html
  10. http://www.schematron.com/
  11. http://relaxng.org/compact-20021121.html
  12. http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML.rng
  13. http://www.thaiopensource.com/relaxng/trang.html
  14. http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML.rnc
  15. http://www.neomantic.com/gedcom5.5XML/0.1/gedcom55XML-0.1.tar.gz
  16. file://localhost/home/calbers/works/src/git-managed/gedcom55xml/gnupg
  17. http://www.neomantic.com/gnupg/pubkey.asc
  18. http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML-0.1.tar.gz.sign
  19. http://www.gnu.org/licenses/old-licenses/gpl-2.0.html
  20. http://www.neomantic.com/gedcom55XML
  21. http://homepage.ntlworld.com/michael.h.kay/gedml/
  22. http://saxon.sourceforge.net/
  23. http://www.xml.com/pub/a/2004/02/11/relaxtron.html?page=5
  24. http://www.topologi.com/
  25. http://www.topologi.com/public/
  26. http://www.topologi.com/public/Schematron.zip
  27. http://www.thaiopensource.com/relaxng/jing.html
  28. http://www.thaiopensource.com/relaxng/jing-other.html
  29. http://www.topologi.com/public/Schtrn_XSD/RNG2Schtrn.zip
  30. http://www.gnu.org/licenses/fdl.html
  31. http://www.neomantic.com/gedcom55XML/0.1/README.html
  32. mailto:chad@neomantic.com

About

RELAX NG/Schematron Schema for GEDCOM 5.5 XML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published