Skip to content

Commit e5813df

Browse files
authored
[Data liberation] wp_rewrite_urls() (#1893)
Prototypes a `wp_rewrite_urls()` URL rewriter for block markup to migrate the content from, say, `<a href="https://adamadam.blog">` to `<a href="https://adamziel.com/blog">`. * URL rewriting works to perhaps the greatest extent it ever did in WordPress migrations. * The URL parser requires PHP 8.1. This is fine for some Playground applications, but we'll need PHP 7.2+ compatibility to get it into WordPress core. * This PR features `WP_HTML_Tag_Processor` and `WP_HTML_Processor` to enable usage outside of WordPress core. ### Details This PR consists of a code ported from https://github.com/adamziel/site-transfer-protocol. It uses a cascade of parsers to pierce through the structured data in a WordPress post and replace the URLs matching the requested domain. The data flow is as follows: Parse HTML -> Parse block comments -> Parse attributes JSON -> Parse URLs On a high level, this parsing cascade is handled by the `WP_Block_Markup_Url_Processor` class: ```php $p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url ); while ( $p->next_url() ) { $parsed_matched_url = $p->get_parsed_url(); // .. do processing $p->set_raw_url($new_raw_url); } ``` Getting more into details, the `WP_Block_Markup_Url_Processor` extends the `WP_HTML_Tag_Processor` class and walks the block markup token by token. It then drills down into: * Text nodes – where matches URLs using regexps. This part can be improved to avoid regular expressions. * Block comments – where it parses the block attributes and iterates through them, looking for ones that contain valid URLs * HTML tag attributes – where it looks for ones that are reserved for URLs (such as `<a href="">`, looking for ones that contain valid URLs The `next_url()` method moves through the stream of tokens, looking for the next match in one of the above contexts, and the `set_raw_url()` knows how to update each node type, e.g. block attributes updates are `json_encode()`-d. ### Processing tricky inputs When this code is fed into the migrator: ```html <!-- wp:paragraph --> <p> <!-- Inline URLs are migrated --> 🚀-science.com/science has the best scientific articles on the internet! We're also available via the punycode URL: <!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path --> &#104;ttps://xn---&#115;&#99;ience-7f85g.com/%73%63ience/. <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science </p> <!-- /wp:paragraph --> <!-- Block attributes are migrated without any issue --> <!-- wp:image {"src": "https:\/\/\ud83d\ude80-\u0073\u0063ience.com/%73%63ience/wp-content/image.png"} --> <!-- As are URI HTML attributes --> <img src="&#104;ttps://xn---&#115;&#99;ience-7f85g.com/science/wp-content/image.png"> <!-- /wp:image --> <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` This actual output is produced: ```html <!-- wp:paragraph --> <p> <!-- Inline URLs are migrated --> science.wordpress.com has the best scientific articles on the internet! We're also available via the punycode URL: <!-- No problem handling HTML-encoded punycode URLs with urlencoded characters in the path --> https://science.wordpress.com/. <!-- Correctly ignores similar–but–different URLs --> This isn't migrated: https://🚀-science.comcast/science <br> Or this: super-🚀-science.com/science </p> <!-- /wp:paragraph --> <!-- Block attributes are migrated without any issue --> <!-- wp:image {"src":"https:\/\/science.wordpress.com\/wp-content\/image.png"} --> <!-- As are URI HTML attributes --> <img src="https://science.wordpress.com/wp-content/image.png"> <!-- /wp:image --> <!-- Classes are not migrated. --> <span class="https://🚀-science.com/science"></span> ``` ## Remaining work - [x] Add PHPCBF - [x] Get to zero CBF errors - [x] Get the unit tests to run in CI (e.g. run `composer install`) - [x] Add relevant unit tests coverage ## Follow-up work - [x] Patch `WP_HTML_Tag_Processor` in WordPress core, see WordPress/wordpress-develop#7007 (comment) - [ ] Package our copy of `WP_HTML_Tag_Processor` as a "WordPress polyfill" for standalone usage. - [ ] Make it compatible with PHP 7.2+ ## Testing Instructions (or ideally a Blueprint) CI runs the PHP unit tests. To run this on your local machine, do this: ```sh cd packages/playground/data-liberation composer install cd ../../../ nx test:watch playground-data-liberation ```
1 parent 9b28cb4 commit e5813df

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+32925
-2
lines changed

.github/workflows/ci.yml

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,31 @@ jobs:
2323
- uses: ./.github/actions/prepare-playground
2424
- run: npx nx affected --target=lint
2525
- run: npx nx affected --target=typecheck
26+
lint-and-test-php:
27+
name: 'Lint and test PHP'
28+
runs-on: ubuntu-latest
29+
steps:
30+
- uses: actions/checkout@v4
31+
with:
32+
submodules: true
33+
- uses: actions/setup-node@v4
34+
with:
35+
node-version: 20
36+
- run: npm install
37+
- name: Set up PHP
38+
uses: shivammathur/setup-php@v2
39+
with:
40+
# @TODO: Running the tests on PHP 7.2
41+
php-version: '8.1'
42+
tools: phpunit-polyfills
43+
- name: Install Composer dependencies
44+
uses: ramsey/composer-install@v3
45+
with:
46+
ignore-cache: 'yes'
47+
composer-options: '--optimize-autoloader'
48+
working-directory: 'packages/playground/data-liberation'
49+
- run: npx nx run playground-data-liberation:lint:php
50+
- run: npx nx run playground-data-liberation:test:phpunit
2651
test-unit-asyncify:
2752
runs-on: ubuntu-latest
2853
needs: [lint-and-typecheck]

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ packages/docs/site/src/model.json
1313
.docusaurus
1414
dist.zip
1515
rollup.d.ts
16+
.phpunit.cache
17+
packages/playground/data-liberation/vendor
1618

1719
# dependencies
1820
node_modules

.gitmodules

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,7 @@
11
[submodule "isomorphic-git"]
22
path="isomorphic-git"
33
url=https://github.com/adamziel/isomorphic-git.git
4+
[submodule "wp-html-api"]
5+
path="wp-html-api"
6+
url=https://github.com/WordPress/wordpress-develop
7+

.vscode/settings.json

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,5 +33,6 @@
3333
"C_Cpp.errorSquiggles": "disabled",
3434
"git.branchProtection": [
3535
"trunk"
36-
]
36+
],
37+
"php.version": "7.2"
3738
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
<?php
2+
/**
3+
* This script regenerates the public suffix list from the publicsuffix.org website.
4+
*/
5+
6+
$suffixes = file_get_contents('https://publicsuffix.org/list/public_suffix_list.dat');
7+
$lines = explode("\n", $suffixes);
8+
$tlds = array();
9+
foreach ($lines as $line) {
10+
if ( empty( $line ) || $line[0] === '/' ) {
11+
continue;
12+
}
13+
if ( strpos( $line, '.' ) !== false ) {
14+
continue;
15+
}
16+
$tlds[] = $line;
17+
}
18+
19+
20+
$php_file_path = __DIR__ . '/../src/public_suffix_list.php';
21+
22+
$new_php_file_path = $php_file_path.'.swp';
23+
$fp = fopen($new_php_file_path, 'w');
24+
fwrite($fp, "<?php\n\n");
25+
fwrite($fp, "/**");
26+
fwrite($fp, "\n * Public suffix list for detecting URLs with known domains within text.");
27+
fwrite($fp, "\n * This file is automatically generated by regenerate_public_suffix_list.php.");
28+
fwrite($fp, "\n * Do not edit it directly.");
29+
fwrite($fp, "\n * @TODO: Process wildcards and exceptions, not just raw TLDs.");
30+
fwrite($fp, "\n */\n\n");
31+
fwrite($fp, "return array(\n");
32+
foreach($tlds as $tld) {
33+
fwrite($fp, "\t'".$tld."' => 1,\n");
34+
}
35+
36+
fwrite($fp, ");\n");
37+
38+
if(file_exists($php_file_path)) {
39+
unlink($php_file_path);
40+
}
41+
rename($new_php_file_path, $php_file_path);
Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
<?php
2+
3+
require_once __DIR__ . "/../bootstrap.php";
4+
5+
if ( $argc < 2 ) {
6+
echo "Usage: php script.php <command> --file <input-file> --current-site-url <current site url> --new-site-url <target url>\n";
7+
echo "Commands:\n";
8+
echo " list_urls: List all the URLs found in the input file.\n";
9+
echo " migrate_urls: Migrate all the URLs found in the input file from the current site to the target site.\n";
10+
exit( 1 );
11+
}
12+
13+
$command = $argv[1];
14+
$options = [];
15+
16+
for ( $i = 2; $i < $argc; $i ++ ) {
17+
if ( str_starts_with( $argv[ $i ], '--' ) && isset( $argv[ $i + 1 ] ) ) {
18+
$options[ substr( $argv[ $i ], 2 ) ] = $argv[ $i + 1 ];
19+
$i ++;
20+
}
21+
}
22+
23+
if ( ! isset( $options['file'] ) ) {
24+
echo "The file option is required.\n";
25+
exit( 1 );
26+
}
27+
28+
$inputFile = $options['file'];
29+
if ( ! file_exists( $inputFile ) ) {
30+
echo "The file $inputFile does not exist.\n";
31+
exit( 1 );
32+
}
33+
$block_markup = file_get_contents( $inputFile );
34+
35+
// @TODO: Decide – should the current site URL be always required to
36+
// populate $base_url?
37+
$base_url = $options['current-site-url'] ?? 'https://playground.internal';
38+
$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
39+
40+
switch ( $command ) {
41+
case 'list_urls':
42+
echo "URLs found in the markup:\n\n";
43+
wp_list_urls_in_block_markup( [ 'block_markup' => $block_markup, 'base_url' => $base_url ]);
44+
echo "\n";
45+
break;
46+
case 'migrate_urls':
47+
if ( ! isset( $options['current-site-url'] ) ) {
48+
echo "The --current-site-url option is required for the migrate_urls command.\n";
49+
exit( 1 );
50+
}
51+
if ( ! isset( $options['new-site-url'] ) ) {
52+
echo "The --new-site-url option is required for the migrate_urls command.\n";
53+
exit( 1 );
54+
}
55+
56+
echo "Replacing $base_url with " . $options['new-site-url'] . " in the input.\n\n";
57+
if (!is_dir('./assets')) {
58+
mkdir('./assets/', 0777, true);
59+
}
60+
$result = wp_rewrite_urls( array(
61+
'block_markup' => $block_markup,
62+
'base_url' => $base_url,
63+
'current-site-url' => $options['current-site-url'],
64+
'new-site-url' => $options['new-site-url'],
65+
) );
66+
if(!is_string($result)) {
67+
echo "Error! \n";
68+
print_r($result);
69+
exit( 1 );
70+
}
71+
echo $result;
72+
break;
73+
}
74+
75+
function wp_list_urls_in_block_markup( $options ) {
76+
$block_markup = $options['block_markup'];
77+
$base_url = $options['base_url'] ?? 'https://playground.internal';
78+
$p = new WP_Block_Markup_Url_Processor( $block_markup, $base_url );
79+
while ( $p->next_url() ) {
80+
// Skip empty relative URLs.
81+
if ( ! trim( $p->get_raw_url() ) ) {
82+
continue;
83+
}
84+
echo '* ';
85+
switch ( $p->get_token_type() ) {
86+
case '#tag':
87+
echo 'In <' . $p->get_tag() . '> tag attribute "' . $p->get_inspected_attribute_name() . '": ';
88+
break;
89+
case '#block-comment':
90+
echo 'In a ' . $p->get_block_name() . ' block attribute "' . $p->get_block_attribute_key() . '": ';
91+
break;
92+
case '#text':
93+
echo 'In #text: ';
94+
break;
95+
}
96+
echo $p->get_raw_url() . "\n";
97+
}
98+
}
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
<?php
2+
3+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-token.php";
4+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-span.php";
5+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-text-replacement.php";
6+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-decoder.php";
7+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-attribute-token.php";
8+
9+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-decoder.php";
10+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-tag-processor.php";
11+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-open-elements.php";
12+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-token-map.php";
13+
require_once __DIR__ . "/src/wordpress-core-html-api/html5-named-character-references.php";
14+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-active-formatting-elements.php";
15+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-processor-state.php";
16+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-unsupported-exception.php";
17+
require_once __DIR__ . "/src/wordpress-core-html-api/class-wp-html-processor.php";
18+
19+
require_once __DIR__ . '/src/WP_Block_Markup_Processor.php';
20+
require_once __DIR__ . '/src/WP_Block_Markup_Url_Processor.php';
21+
require_once __DIR__ . '/src/WP_URL_In_Text_Processor.php';
22+
require_once __DIR__ . '/src/WP_URL.php';
23+
require_once __DIR__ . '/vendor/autoload.php';
24+
25+
26+
// Polyfill WordPress core functions
27+
function _doing_it_wrong() {
28+
29+
}
30+
31+
function __($input) {
32+
return $input;
33+
}
34+
35+
function esc_attr($input) {
36+
return htmlspecialchars($input);
37+
}
38+
39+
function esc_html($input) {
40+
return htmlspecialchars($input);
41+
}
42+
43+
function esc_url($url) {
44+
return htmlspecialchars($url);
45+
}
46+
47+
function wp_kses_uri_attributes() {
48+
return array(
49+
'action',
50+
'archive',
51+
'background',
52+
'cite',
53+
'classid',
54+
'codebase',
55+
'data',
56+
'formaction',
57+
'href',
58+
'icon',
59+
'longdesc',
60+
'manifest',
61+
'poster',
62+
'profile',
63+
'src',
64+
'usemap',
65+
'xmlns',
66+
);
67+
}
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
{
2+
"name": "wordpress/data-liberation",
3+
"prefer-stable": true,
4+
"require": {
5+
"ext-json": "*",
6+
"php": ">=7.2",
7+
"rowbot/url": "^4.0"
8+
},
9+
"require-dev": {
10+
"yoast/phpunit-polyfills": "2.0.0",
11+
"squizlabs/php_codesniffer": "3.*",
12+
"wp-coding-standards/wpcs": "3.1.0",
13+
"phpcompatibility/php-compatibility": "*"
14+
},
15+
"config": {
16+
"optimize-autoloader": true,
17+
"preferred-install": "dist",
18+
"allow-plugins": {
19+
"dealerdirect/phpcodesniffer-composer-installer": true
20+
}
21+
},
22+
"autoload": {
23+
"classmap": [
24+
"src/"
25+
],
26+
"psr-4": {
27+
"WordPress\\DataLiberation\\": "src/WordPress"
28+
},
29+
"files": [
30+
"src/functions.php"
31+
]
32+
},
33+
"autoload-dev": {
34+
"classmap": [
35+
"tests/"
36+
]
37+
},
38+
"authors": [
39+
{
40+
"name": "WordPress Contributors",
41+
"email": "contributors@wordpress.org"
42+
}
43+
]
44+
}

0 commit comments

Comments
 (0)