Skip to content

Checking and converting loaded content in HTML Reader based on charset  #3995

Closed
@pdscopes

Description

This is:

- [x] a bug report
- [x] a feature request
- [ ] **not** a usage question (ask them on https://stackoverflow.com/questions/tagged/phpspreadsheet or https://gitter.im/PHPOffice/PhpSpreadsheet)

What is the expected behavior?

That an HTML with a non "UTF-8" charset is converted into UTF-8 before being loaded.

What is the current behavior?

Inside PhpOffice\PhpSpreadsheet\Reader\Html::loadIntoExisting it fails to load as the preg_replace_callback step throws an error due to the non "UTF-8" encoding of $convert.

What are the steps to reproduce?

Create a HTML file with iso-8859-1 encoding. And try to read it using the HTML Reader.

What features do you think are causing the issue

Reader.

Does an issue affect all spreadsheet file formats? If not, which formats are affected?

This only affects the HTML Reader.

Which versions of PhpSpreadsheet and PHP are affected?

PHP 8.2 and 8.3 and PhpSpreadshee 1.29 and 2.0.

Proposed change to code to resolve this issue

I have knocked up a fix for this but I'm not sure if it should be more complex or in it's own function. I based this on the code that is in PhpOffice\PhpSpreadsheet\Reader\Security\XmlScanner::toUtf8 function. I don't think it belongs in there as it is to do with HTML methods of declaring encodings but I'm happy to be corrected:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

<meta charset="iso-8859-1">
    /**
     * Loads PhpSpreadsheet from file into PhpSpreadsheet instance.
     */
    public function loadIntoExisting(string $filename, Spreadsheet $spreadsheet): Spreadsheet
    {
        // Validate
        if (!$this->canRead($filename)) {
            throw new Exception($filename . ' is an Invalid HTML file.');
        }

        // Create a new DOM object
        $dom = new DOMDocument();

        // Reload the HTML file into the DOM object
        try {
            $convert = $this->getSecurityScannerOrThrow()->scanFile($filename);

            // START OF NEW CODE

            // Check for non-"UTF-8" charset
            $pattern = '/charset="?(.*?)("|;)/';
            $result = preg_match($pattern, $convert, $matches);
            if ($result) {
                $charset = strtoupper($matches[1]);
                $convert = mb_convert_encoding($convert, 'UTF-8', $charset);
                $convert = is_string($convert) ? $convert : '';
            }

            // END OF NEW CODE

            $lowend = "\u{80}";
            $highend = "\u{10ffff}";
            $regexp = "/[$lowend-$highend]/u";
            /** @var callable $callback */
            $callback = [self::class, 'replaceNonAscii'];
            $convert = preg_replace_callback($regexp, $callback, $convert);
            $loaded = ($convert === null) ? false : $dom->loadHTML($convert);
        } catch (Throwable $e) {
            $loaded = false;
        }
        if ($loaded === false) {
            throw new Exception('Failed to load ' . $filename . ' as a DOM Document', 0, $e ?? null);
        }
        self::loadProperties($dom, $spreadsheet);

        return $this->loadDocument($dom, $spreadsheet);
    }

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions