Page MenuHomePhabricator

PHP's getElementsByTagName is slow prior to PHP 8.3
Open, Needs TriagePublic

Description

T317070 and T215000 both dealt with non-linear performance of getElementsByTagName in PHP DOM's implementation.

The issue was reported upstream on May 24, 2023 and fixed in the master branch by Niels Dossche nine days later.

As a temporary workaround while we wait for PHP 8.3 to be released and deployed, Parsoid's DOMCompat::getElementsByTagName may be used instead. For projects that don't have Parsoid as a dependency, an equivalent XPath query may be used as a workaround.

Code search reveals Flow, DonationsInterface, Translate, GWToolset, TextExtracts all use this method (https://codesearch.wmcloud.org/search/?q=getElementsByTagName&i=nope&files=php&excludeFiles=)

Event Timeline

Restricted Application added a subscriber: Aklapper. · View Herald Transcript

The uses in Translate are a maintenance script and Xliff file format handler which is currently unused: https://codesearch.wmcloud.org/search/?q=getElementsByTagName&i=nope&files=php&excludeFiles=&repos=Extension:Translate

What replacements are present in MediaWiki core version 1.37 and above?

Jdforrester-WMF subscribed.

The uses in Translate are a maintenance script and Xliff file format handler which is currently unused: https://codesearch.wmcloud.org/search/?q=getElementsByTagName&i=nope&files=php&excludeFiles=&repos=Extension:Translate

What replacements are present in MediaWiki core version 1.37 and above?

The mentioned Zest::find is present in 1.37 (indeed, since 1.35.0-wmf.20):

https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/vendor/+/refs/heads/REL1_37/wikimedia/zest-css/src/Zest.php#41

The easy fix is to replace $node->getElementsByTagName( $tag ) with Zest::getElementsByTagName( $node, $tag ) (which unlike find has identical signature).

Change 831201 had a related patch set uploaded (by Catrope; author: Catrope):

[mediawiki/core@master] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/831201

Change 831201 merged by jenkins-bot:

[mediawiki/core@master] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/831201

It looks like it may be a regression from 3084e72ef1053b8ead468e66baf55d5256e7e9af. getElementsByTagName() works by returning a special DOMNodeList which finds the next matching tag at each loop iteration. Before that commit, the search began at the previously found node. After that commit, it uses integer offsets. The search begins at the root node every time, iterating through the whole document until the stored number of elements has been passed.

Niels Dossche fixed it in #11330 and so the issue will be fully rectified once we migrate to PHP 8.3. I will update the task description accordingly.

tstarling renamed this task from Migrate code away from use of PHPDOM's getElementsByTagName which has non-linear perf characteristics to PHP's getElementsByTagName is slow prior to PHP 8.3.Jun 3 2023, 5:19 AM
tstarling updated the task description. (Show Details)

Note that the fix only works when iterating through the results without modifying the DOM. If any DOM modifications are made during iteration, it reverts back to the old n² algorithm.

The fix is to use iterator_to_array() first, or otherwise convert the live collection to a normal array, and iterate over the result of that. (This is the same as how things work in JavaScript in web browsers, where you'll see Array.from() used in the same way.)

The fix is to use iterator_to_array() first. […] This is the same as how things work in JavaScript in web browsers, where you'll see Array.from() used in the same way.

Good catch. I believe these tend to be used via an abstraction layer in JavaScript, though. For us, that's generally jQuery or document.querySelectorAll. Both of these return non-live collections that are thus already "fixed". Use of live collections as returned by getElementsByClassName or getElementsByTagName is fairly rare in "normal" JavaScript code, but may be used by libraries (such as jQuery), which indeed then iterate it before returning to avoid triggering expensive property accessors.

For PHP, I guess Zest counts as our abstraction layer. Although it's a bit more common for php-dom to be used directly, which, given that PHP lacks a native querySelector, means use of getElementsByTagName is very likely in practice.

Change 937477 had a related patch set uploaded (by Reedy; author: Catrope):

[mediawiki/core@REL1_39] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/937477

Change 937477 merged by jenkins-bot:

[mediawiki/core@REL1_39] VueComponentParser: Use Zest's getElementsByTagName() rather than PHP's

https://gerrit.wikimedia.org/r/937477