Skip to content

getElementsByTagName() is O(N^2) #11308

Closed
@tstarling

Description

@tstarling

Description

DOMDocument::getElementsByTagName() and DOMElement::getElementsByTagName() return an iterator which does a linear time search for the current position, each time the position is moved to the next node. So completing the iteration of the node list takes O(N^2) time where N is the number of nodes in the list.

For example:

for ( $n = 1000; $n <= 16000; $n *= 2 ) {
	$doc = new DOMDocument;
	$root = $doc->createElement( 'root' );
	$doc->appendChild( $root );
	for ( $i = 0; $i < $n; $i++ ) {
		$root->appendChild( $doc->createElement( 'e' ) );
	}
	$t = -microtime( true );
	foreach ( $doc->getElementsByTagName( 'e' ) as $node );
	$t += microtime( true );
	print "$n\t$t\n";
}

resulted in this output:

1000    0.0031359195709229
2000    0.012067079544067
4000    0.050027132034302
8000    0.20300078392029
16000   0.76775693893433

It was likely faster prior to 3084e72. Since that commit in 2003, getElementsByTagName() is similar to the following pseudocode:

function getElementsByTagName( $root, $name ) {
	$i = 0;
	do {
		$node = findNextNode( $root, $name );
		for ( $j = 0; $node && $j < $i; $j++ ) {
			$node = findNextNode( $node, $name );
		}
		
		if ( $node ) {
			yield $node;
		}
		$i++;
	} while ( $node );
}

That commit was motivated by standards compliance. The standard says that the node list returned by getElementsByTagName() is live, that is, it must immediately reflect changes to the document. The standard has no concept of iteration, it only has an integer offset which can be passed to item(). The pseudocode above does correctly reproduce the quirks implied by the standard. For example:

$doc = new DOMDocument;
$doc->loadXML( '<root><e i="1"/><e i="2"/><e i="3"/><e i="4"/><e i="5"/><e i="6"/><e i="7"/><e i="8"/><e i="9"/><e i="10"/></root>' );
$root = $doc->documentElement;
foreach ( $doc->getElementsByTagName( 'e' ) as $node ) {
	print $node->getAttribute( 'i' ) . ' ';
	$root->removeChild( $node );
}
print "\n";

produces

1 3 5 7 9 

WebKit implements the standard efficiently by having an item cache and a length cache. The item cache allows efficient retrieval of an item which has an index close to the previously requested item. These caches are invalidated on tree mutation.

PHP has no concept of tree mutation event callbacks, so implementing this in PHP may be somewhat tedious.

PHP Version

PHP 8.2.6

Operating System

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions