Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Php/iconv Should Not Treat FFFE/FFFF as Valid #2910

Merged
merged 2 commits into from
Jul 2, 2022

Conversation

oleibman
Copy link
Collaborator

Fix #2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences.

A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.

This is:

- [x] a bugfix
- [ ] a new feature
- [ ] refactoring
- [ ] additional unit tests

Checklist:

  • Changes are covered by unit tests
    • Changes are covered by existing unit tests
    • New unit tests have been added
  • Code style is respected
  • Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
  • CHANGELOG.md contains a short summary of the change
  • Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

oleibman added 2 commits June 25, 2022 23:19
Fix PHPOffice#2897. We have been relying on iconv/mb_convert_encoding to detect invalid UTF-8, but all techniques designed to validate UTF-8 seem to accept FFFE and FFFF. This PR explicitly converts those characters to FFFD (Unicode substitution character) before validating the rest of the string. It also substitutes one or more FFFD when it detects invalid UTF-8 character sequences.

A comment in the code being change stated that it doesn't handle surrogates. It is right not to do so. The only case where we should see surrogates is reading UTF-16. Additional tests are added to an existing test reading a UTF-16 Csv to demonstrate that surrogates are handled correctly, and that FFFE/FFFF are handled reasonably.
@oleibman oleibman merged commit c3f5385 into PHPOffice:master Jul 2, 2022
MarkBaker added a commit that referenced this pull request Jul 9, 2022
Note that this will be the last 1.x branch release before the 2.x release. We will maintain both branches in parallel for a time; but users are requested to update to version 2.0 once that is fully available.

### Added

- Added `removeComment()` method for Worksheet [PR #2875](https://github.com/PHPOffice/PhpSpreadsheet/pull/2875/files)
- Add point size option for scatter charts [Issue #2298](#2298) [PR #2801](#2801)
- Basic support for Xlsx reading/writing Chart Sheets [PR #2830](#2830)

  Note that a ChartSheet is still only written as a normal Worksheet containing a single chart, not as an actual ChartSheet.

- Added Worksheet visibility in Ods Reader [PR #2851](#2851) and Gnumeric Reader [PR #2853](#2853)
- Added Worksheet visibility in Ods Writer [PR #2850](#2850)
- Allow Csv Reader to treat string as contents of file [Issue #1285](#1285) [PR #2792](#2792)
- Allow Csv Reader to store null string rather than leave cell empty [Issue #2840](#2840) [PR #2842](#2842)
- Provide new Worksheet methods to identify if a row or column is "empty", making allowance for different definitions of "empty":
  - Treat rows/columns containing no cell records as empty (default)
  - Treat cells containing a null value as empty
  - Treat cells containing an empty string as empty

### Changed

- Modify `rangeBoundaries()`, `rangeDimension()` and `getRangeBoundaries()` Coordinate methods to work with row/column ranges as well as with cell ranges and cells [PR #2926](#2926)
- Better enforcement of value modification to match specified datatype when using `setValueExplicit()`
- Relax validation of merge cells to allow merge for a single cell reference [Issue #2776](#2776)
- Memory and speed improvements, particularly for the Cell Collection, and the Writers.

  See [the Discussion section on github](#2821) for details of performance across versions
- Improved performance for removing rows/columns from a worksheet

### Deprecated

- Nothing

### Removed

- Nothing

### Fixed

- Xls Reader resolving absolute named ranges to relative ranges [Issue #2826](#2826) [PR #2827](#2827)
- Null value handling in the Excel Math/Trig PRODUCT() function [Issue #2833](#2833) [PR #2834](#2834)
- Invalid Print Area defined in Xlsx corrupts internal storage of print area [Issue #2848](#2848) [PR #2849](#2849)
- Time interval formatting [Issue #2768](#2768) [PR #2772](#2772)
- Copy from Xls(x) to Html/Pdf loses drawings [PR #2788](#2788)
- Html Reader converting cell containing 0 to null string [Issue #2810](#2810) [PR #2813](#2813)
- Many fixes for Charts, especially, but not limited to, Scatter, Bubble, and Surface charts. [Issue #2762](#2762) [Issue #2299](#2299) [Issue #2700](#2700) [Issue #2817](#2817) [Issue #2763](#2763) [Issue #2219](#2219) [Issue #2863](#2863) [PR #2828](#2828) [PR #2841](#2841) [PR #2846](#2846) [PR #2852](#2852) [PR #2856](#2856) [PR #2865](#2865) [PR #2872](#2872) [PR #2879](#2879) [PR #2898](#2898) [PR #2906](#2906) [PR #2922](#2922) [PR #2923](#2923)
- Adjust both coordinates for two-cell anchors when rows/columns are added/deleted. [Issue #2908](#2908) [PR #2909](#2909)
- Keep calculated string results below 32K. [PR #2921](#2921)
- Filter out illegal Unicode char values FFFE/FFFF. [Issue #2897](#2897) [PR #2910](#2910)
- Better handling of REF errors and propagation of all errors in Calculation engine. [PR #2902](#2902)
- Calculating Engine regexp for Column/Row references when there are multiple quoted worksheet references in the formula [Issue #2874](#2874) [PR #2899](#2899)
@oleibman oleibman deleted the issue2897 branch July 23, 2022 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

fromArray method write cell failure because of xEFxBFxBE character(s) existence in string
1 participant