|
| 1 | +# Querying HTML and XML Documents |
| 2 | + |
| 3 | +`Zend\Dom\Query` provides mechanisms for querying XML and HTML documents |
| 4 | +utilizing either XPath or CSS selectors. It was developed to aid with functional |
| 5 | +testing of MVC applications, but could also be used for development of screen |
| 6 | +scrapers. |
| 7 | + |
| 8 | +CSS selector notation is provided as a simpler and more familiar notation for |
| 9 | +web developers to utilize when querying documents with XML structures. The |
| 10 | +notation should be familiar to anybody who has developed Cascading Style Sheets |
| 11 | +or who utilizes javascript toolkits that provide functionality for selecting |
| 12 | +nodes utilizing CSS selectors. [Prototype's $$()](http://prototypejs.org/api/utility/dollar-dollar), |
| 13 | +[Dojo's dojo.query](http://api.dojotoolkit.org/jsdoc/dojo/HEAD/dojo.query), and |
| 14 | +[jQuery](https://jquery.com) were all inspirations for the component. |
| 15 | + |
| 16 | +## Theory of Operation |
| 17 | + |
| 18 | +To use `Zend\Dom\Query`, you instantiate a `Zend\Dom\Query` object, optionally |
| 19 | +passing a document to query (a string). Once you have a document, you can use |
| 20 | +either the `execute()` or `queryXpath()` methods; each method will return a |
| 21 | +`Zend\Dom\NodeList` object with any matching nodes. |
| 22 | + |
| 23 | +The primary difference between `Zend\Dom\Query` and using |
| 24 | +[DOMDocument](http://php.net/domdocument) + [DOMXPath](http://php.net/domxpath) |
| 25 | +is the ability to select against CSS + selectors. You can utilize any of the |
| 26 | +following, in any combination: |
| 27 | + |
| 28 | +- **element types**: provide an element type to match: `div`, `a`, `span`, `h2`, etc. |
| 29 | +- **style attributes**: CSS style attributes to match: `.error`, `div.error`, |
| 30 | + `label.required`, etc. If an element defines more than one style, this will |
| 31 | + match as long as the named style is present anywhere in the style declaration. |
| 32 | +- **id attributes**: element ID attributes to match: `#content`, `div#nav`, etc. |
| 33 | +- **arbitrary attributes**: arbitrary element attributes to match. Three |
| 34 | + different types of matching are provided: |
| 35 | + - **exact match**: the attribute *exactly* matches the specified string. |
| 36 | + `div[bar="baz"]` would match a `div` element with a `bar` attribute that |
| 37 | + exactly matches the value `baz`. |
| 38 | + - **word match**: the attribute contains a *word* matching the string. |
| 39 | + `div[bar~="baz"]` would match a `div` element with a `bar` attribute that |
| 40 | + contains the word `baz`. `<div bar="foo baz">` would match, but |
| 41 | + `<div bar="foo bazbat">` would not. |
| 42 | + - **substring match**: the attribute contains the string specified, whether or |
| 43 | + not it is a complete word. `div[bar*="baz"]` would match a `div` element |
| 44 | + with a `bar` attribute that contains the string `baz` anywhere within it. |
| 45 | +- **direct descendents**: utilize `>` between selectors to denote direct |
| 46 | + descendents. `div > span` would select only `span` elements that are direct |
| 47 | + descendents of a `div`. Can also be used with any of the selectors above. |
| 48 | +- **descendents**: string together multiple selectors to indicate a hierarchy along which to search. |
| 49 | + `div .foo span #one` would select an element of id `one` that is a descendent |
| 50 | + of arbitrary depth beneath a `span` element, which is in turn a descendent of |
| 51 | + arbitrary depth beneath an element with a class of `foo`, that is an |
| 52 | + descendent of arbitrary depth beneath a `div` element. For example, it would |
| 53 | + match the link to the word 'One' in the listing below: |
| 54 | + |
| 55 | + ```html |
| 56 | + <div> |
| 57 | + <table> |
| 58 | + <tr> |
| 59 | + <td class="foo"> |
| 60 | + <div> |
| 61 | + Lorem ipsum <span class="bar"> |
| 62 | + <a href="/foo/bar" id="one">One</a> |
| 63 | + <a href="/foo/baz" id="two">Two</a> |
| 64 | + <a href="/foo/bat" id="three">Three</a> |
| 65 | + <a href="/foo/bla" id="four">Four</a> |
| 66 | + </span> |
| 67 | + </div> |
| 68 | + </td> |
| 69 | + </tr> |
| 70 | + </table> |
| 71 | + </div> |
| 72 | + ``` |
| 73 | + |
| 74 | +Once you've performed your query, you can then work with the result object to |
| 75 | +determine information about the nodes, as well as to pull them and/or their |
| 76 | +content directly for examination and manipulation. `Zend\Dom\NodeList` |
| 77 | +implements `Countable` and `Iterator`, and stores the results internally as a |
| 78 | +[DOMDocument](http://php.net/domdocument) and [DOMNodeList](http://php.net/domnodelist). |
| 79 | + |
| 80 | +As an example, consider the following call, that selects against the HTML above: |
| 81 | + |
| 82 | +```php |
| 83 | +use Zend\Dom\Query; |
| 84 | + |
| 85 | +$dom = new Query($html); |
| 86 | +$results = $dom->execute('.foo .bar a'); |
| 87 | + |
| 88 | +$count = count($results); // get number of matches: 4 |
| 89 | +foreach ($results as $result) { |
| 90 | + // $result is a DOMElement |
| 91 | +} |
| 92 | +``` |
| 93 | + |
| 94 | +`Zend\Dom\Query` also allows straight XPath queries utilizing the `queryXpath()` |
| 95 | +method; you can pass any valid XPath query to this method, and it will return a |
| 96 | +`Zend\Dom\NodeList` object. |
| 97 | + |
| 98 | +## Methods Available |
| 99 | + |
| 100 | +Below is a listing of methods available in the various classes exposed by |
| 101 | +zend-dom. |
| 102 | + |
| 103 | +### Zend\\Dom\\Query |
| 104 | + |
| 105 | +The following methods are available to `Zend\Dom\Query`: |
| 106 | + |
| 107 | +- `setDocumentXml($document, $encoding = null)`: specify an XML string to query against. |
| 108 | +- `setDocumentXhtml($document, $encoding = null)`: specify an XHTML string to query against. |
| 109 | +- `setDocumentHtml($document, $encoding = null)`: specify an HTML string to query against. |
| 110 | +- `setDocument($document, $encoding = null)`: specify a string to query against; |
| 111 | + `Zend\Dom\Query` will then attempt to autodetect the document type. |
| 112 | +- `setEncoding($encoding)`: specify an encoding string to use. This encoding |
| 113 | + will be passed to [DOMDocument's constructor](http://php.net/domdocument.construct) |
| 114 | + if specified. |
| 115 | +- `getDocument()`: retrieve the original document string provided to the object. |
| 116 | +- `getDocumentType()`: retrieve the document type of the document provided to |
| 117 | + the object; will be one of the `DOC_XML`, `DOC_XHTML`, or `DOC_HTML` class |
| 118 | + constants. |
| 119 | +- `getEncoding()`: retrieves the specified encoding. |
| 120 | +- `execute($query)`: query the document using CSS selector notation. |
| 121 | +- `queryXpath($xPathQuery)`: query the document using XPath notation. |
| 122 | + |
| 123 | +### Zend\\Dom\\NodeList |
| 124 | + |
| 125 | +As mentioned previously, `Zend\Dom\NodeList` implements both `Iterator` and |
| 126 | +`Countable`, and as such can be used in a `foreach()` loop as well as with the |
| 127 | +`count()` function. Additionally, it exposes the following methods: |
| 128 | + |
| 129 | +- `getCssQuery()`: return the CSS selector query used to produce the result (if |
| 130 | + any). |
| 131 | +- `getXpathQuery()`: return the XPath query used to produce the result. |
| 132 | + Internally, `Zend\Dom\Query` converts CSS selector queries to XPath, so this |
| 133 | + value will always be populated. |
| 134 | +- `getDocument()`: retrieve the DOMDocument the selection was made against. |
0 commit comments