Skip to content
Nitin Motgi edited this page Nov 7, 2017 · 2 revisions

XML Directive

This page defines a few XML directives supported for interacting with complex XML documents through Dataprep directives.

Parse XML into Document.

PARSE-XML-TO-DOCUMENT directive converts an XML string or XML byte array into a XML Document.

Syntax

parse-xml-to-document :<column>

Usage Notes

PARSE-XML-TO-DOCUMENT directive transforms an XML string into a Document object. This is equivalent to performing the following within Java.

  DocumentBuilderFactory builderFactory =  DocumentBuilderFactory.newInstance();
  builderFactory.setNamespaceAware(true);
  builder = builderFactory.newDocumentBuilder();
  ...
  Document xmlDocument = builder.parse(<column value>);
  ...

This directive replaces :<column> value from string type to XML Document type. If there any issues with XML parsing, an error is thrown terminating the processing.

Extract XPath from the Document.

EXTRACT-XPATH directive extracts a XML node using a XPath.

Syntax

extract-xpath '<xpath>' :<source column> :<destination column>

Usage Notes

This directive extracts the node value using XPath. The XPath is applied to the :<source column> and the result is stored in :<destination column>. The directive can be applied directly on XML that is of type string or on XML Document generated by PARSE-XML-TO-DOCUMENT directive.

Following is an example of XML and the XPaths that are valid and defintion of different ways the XML nodes can be extracted.

<?xml version="1.0"?>
<Employees>
	<Employee emplid="1111" type="admin">
		<firstname>John</firstname>
		<lastname>Watson</lastname>
		<age>30</age>
		<email>johnwatson@sh.com</email>
	</Employee>
	<Employee emplid="2222" type="admin">
		<firstname>Sherlock</firstname>
		<lastname>Homes</lastname>
		<age>32</age>
		<email>sherlock@sh.com</email>
	</Employee>
	<Employee emplid="3333" type="user">
		<firstname>Jim</firstname>
		<lastname>Moriarty</lastname>
		<age>52</age>
		<email>jim@sh.com</email>
	</Employee>
	<Employee emplid="4444" type="user">
		<firstname>Mycroft</firstname>
		<lastname>Holmes</lastname>
		<age>41</age>
		<email>mycroft@sh.com</email>
	</Employee>
</Employees>
Expression Description
nodename Selects all nodes with the name “nodename”
/ Selects from the root node
// Selects nodes in the document from the current node that match the selection no matter where they are
. Selects the current node
.. Selects the parent of the current node
@ Selects attributes
employee Selects all nodes with the name “employee”
employees/employee Selects all employee elements that are children of employees
//employee Selects all book elements no matter where they are in the document

Below list of expressions are called Predicates. The Predicates are defined in square brackets [ … ]. They are used to find a specific node or a node that contains a specific value.

Path Expression Result

  • /employees/employee[1] Selects the first employee element that is the child of the employees element.
  • /employees/employee[last()] Selects the last employee element that is the child of the employees element
  • /employees/employee[last()-1] Selects the last but one employee element that is the child of the employees element
  • //employee[@type='admin'] Selects all the employee elements that have an attribute named type with a value of ‘admin’
  • Home

Clone this wiki locally