Checking and converting loaded content in HTML Reader based on charset #3995
Description
This is:
- [x] a bug report
- [x] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)
What is the expected behavior?
That an HTML with a non "UTF-8" charset is converted into UTF-8 before being loaded.
What is the current behavior?
Inside PhpOffice\PhpSpreadsheet\Reader\Html::loadIntoExisting
it fails to load as the preg_replace_callback
step throws an error due to the non "UTF-8" encoding of $convert
.
What are the steps to reproduce?
Create a HTML file with iso-8859-1
encoding. And try to read it using the HTML Reader.
What features do you think are causing the issue
Reader.
Does an issue affect all spreadsheet file formats? If not, which formats are affected?
This only affects the HTML Reader.
Which versions of PhpSpreadsheet and PHP are affected?
PHP 8.2 and 8.3 and PhpSpreadshee 1.29 and 2.0.
Proposed change to code to resolve this issue
I have knocked up a fix for this but I'm not sure if it should be more complex or in it's own function. I based this on the code that is in PhpOffice\PhpSpreadsheet\Reader\Security\XmlScanner::toUtf8
function. I don't think it belongs in there as it is to do with HTML methods of declaring encodings but I'm happy to be corrected:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<meta charset="iso-8859-1">
/**
* Loads PhpSpreadsheet from file into PhpSpreadsheet instance.
*/
public function loadIntoExisting(string $filename, Spreadsheet $spreadsheet): Spreadsheet
{
// Validate
if (!$this->canRead($filename)) {
throw new Exception($filename . ' is an Invalid HTML file.');
}
// Create a new DOM object
$dom = new DOMDocument();
// Reload the HTML file into the DOM object
try {
$convert = $this->getSecurityScannerOrThrow()->scanFile($filename);
// START OF NEW CODE
// Check for non-"UTF-8" charset
$pattern = '/charset="?(.*?)("|;)/';
$result = preg_match($pattern, $convert, $matches);
if ($result) {
$charset = strtoupper($matches[1]);
$convert = mb_convert_encoding($convert, 'UTF-8', $charset);
$convert = is_string($convert) ? $convert : '';
}
// END OF NEW CODE
$lowend = "\u{80}";
$highend = "\u{10ffff}";
$regexp = "/[$lowend-$highend]/u";
/** @var callable $callback */
$callback = [self::class, 'replaceNonAscii'];
$convert = preg_replace_callback($regexp, $callback, $convert);
$loaded = ($convert === null) ? false : $dom->loadHTML($convert);
} catch (Throwable $e) {
$loaded = false;
}
if ($loaded === false) {
throw new Exception('Failed to load ' . $filename . ' as a DOM Document', 0, $e ?? null);
}
self::loadProperties($dom, $spreadsheet);
return $this->loadDocument($dom, $spreadsheet);
}