-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excel Dynamic Arrays (Avoid Adding At-Signs to Formulas) #3962
Conversation
This has come up a number of times, most recently with issue PHPOffice#3901, and also issue PHPOffice#3659. It will certainly come up more often in days to come. Excel is changing formulas which PhpSpreadsheet has output as `=UNIQUE(A1:A19)`; Excel is processing the formula as it were `=@unique(A1:A19)`. This behavior is explained, in part, by PHPOffice#3659 (comment). It is doing so in order to ensure that the function returns only a single value rather than an array of values, in case the spreadsheet is being processed (or possibly was created) by a less current version of Excel which cannot handle the array result. PhpSpreadsheet follows Excel to a certain extent; it defaults to returning a single calculated value when an array would be returned. Further, its support for outputting an array even when that default is overridden is incomplete. I am not prepared to do everything that Excel does for the array functions (details below), but this PR is a start in that direction. If the default is changed via: ```php use PhpOffice\PhpSpreadsheet\Calculation\Calculation; Calculation::setArrayReturnType(Calculation::RETURN_ARRAY_AS_ARRAY); ``` When that is done, `getCalculatedValue` will return an array (no code change necessary). However, Writer/Xlsx will now be updated to look at that value, and if an array is returned in that circumstance, will indicate in the Xml that the result is an array *and* will include a reference to the bounds of the array. This gets us close, although not completely there, to what Excel does, and may be good enough for now. Excel will still mess with the formula, but now it will treat it as `{=UNIQUE(A1:A19)}`. This means that the spreadsheet will now look correct; there will be superficial differences, but all cells will have the expected value. Technically, the major difference between what PhpSpreadsheet will output now, and what Excel does on its own, is that Excel supplies values in the xml for all the cells in the range. That would be difficult for PhpSpreadsheet to do; that could be a project for another day. Excel will treat the output from PhpSpreadsheet as "Array Formulas" (a.k.a. CSE (control shift enter) formulas because you need to use that combination of keys to manually enter them in older versions of Excel). Current versions of Excel will instead use "Dynamic Array Formulas". Dynamic Array Formulas can be changed by the user; Array Formulas need to be deleted and re-entered if you want to change them. I don't know what else might have to change to get Excel to use the latter for PhpSpreadsheet formulas, and I will probably not even try to look now, saving it for a future date. Unit testing of this change uncovered a bug in Calculation::calculateCellValue. That routine saves off ArrayReturnType, and may change it, and is supposed to restore it. But it does not do the restore if the calculation throws an exception. It is changed to do so.
There was a lot of code that I'd implemented in #2787 that was designed to provide this full support for array formulae, including the handling of ranged returns and support for dynamic ranging with all the relevant tests and some of the new dynamic array functions. |
Thinking about PHPOffice#3958 - user wondered if unsupported formulas with array results could be handled better. I said that the answer was "no", but I think Xlsx Reader can make use of the dimensions of the result after all, so the answer is actually "sometimes". This is an initial attempt to do that. Implementing it revealed a bug in how Xlsx Reader handles array formula attributes, and that is now corrected. Likewise, Xlsx Writer did not indicate a value for the first cell in the array, and does now.
@MarkBaker Thanks, I will look over your code. We took very different approaches. I'll see if I can can reconcile some of the differences. Did you ever figure what was needed to make Excel open the PhpSpreadsheet-generated file so that the array functions are treated as dynamic rather than CSE? |
The solution above doesn't work for me. |
Address sample code submitted by @jr212 which was not working correctly.
@jr212 Thank you for the sample code. Please try again against this PR, which I have changed based on your example. That demonstrated at least 2 problems with the existing code. One, relatively easily handled, was that, since your code used auto-sized columns, the PhpSpreadsheet code wound up formatting the array results, and the formatting code was not expecting array inputs. I will note the array calculation supplies a value only to the first cell of the array, so the auto-sizing is ineffective anyhow. I will think about this some more; perhaps it is something we can live with, or perhaps the code can do better. The second problem is a little more troublesome. A formula like Finally, even though the Excel spreadsheet generated by your code is now correct (I think - please verify), I don't think that getCalculatedValue for cell A2 on sheet 'Gesorteerd verlorenpunten' agrees with what Excel shows. All in all, your example is very useful. But it may cause me to leave this change in draft status for quite a bit longer than I had planned. |
I did the test but nothing changed ☹ I composed twice with
* composer require --prefer-source phpoffice/phpspreadsheet
* composer require phpoffice/phpspreadsheet:dev-master
If that’s wrong remember that I know about nothing of composer.
I did redownloaded the PhpSpreadsheet-2.0-Development.zip code
Jan
|
@jr212 It is lucky then that I didn't realize your initial problem was with Composer, otherwise I might not have noticed the code problems. I am also not expert in Composer, but I think the following should work. Assume that you have a main directory called "git" (change to whatever is appropriate). Go to that directory and issue the following command:
This will create a new directory "git/atsign3". Change to that directory and run:
It would not surprise me in the least if there is a simpler way to do this, but this should work. |
Merged master to eliminate a no-longer-valid dependency in composer.lock, not as prep work before installing. |
@jr212 I believe you can get a much smaller directory by using the following install command rather than the one I initially suggested:
Please bear in mind that the code you will be using is some way from being "production ready". I will be glad to continue to work with you, but this code may never make it into production, and, even if it does, I don't know how soon that might happen. On Windows systems, when a spreadsheet is downloaded from a browser, the behavior you described is normal. I don't need you to maintain the zip file you uploaded any longer. However, I will eventually need to add a test case based on what it showed me. Do you have a problem with my basing that test case on the data and code you supplied? |
No problem its anonymised.
|
There are 2 new problems. Still related I think therefore I use the same
ticket.
Code and example to find https://janr.be/excel/excel.zip. free to use.
1. When running the code the file is created but I got an error from
Excel.(sorry can't translate) but the file seems fine.
2. In the function doTS there are 2 functions. the one with CHOOSECOLS
doesn't work. Not implemented yet?
Op ma 22 apr 2024 om 09:05 schreef ***@***.***>:
… No problem its anonymised.
--
Met vriendelijke groeten
Jan
***@***.***
|
@jr212 Your spreadsheet is no longer available at the address you supplied, probably because I took too long to download it. If you have something you want me to look at, please make it available again. |
Vind it on https://janr.be/excel.zip
|
See issue PHPOffice#4062. When calculating an array formula, populate all the cells associated with the result. This is almost the same as Excel's behavior. As yet, there is no attempt to create a #SPILL error, so cells may be inappropriately overwritten. Also, if the array size shrinks (e.g. there are fewer unique values than before), no attempt is made to unpopulate the cells which were in range but are now outside the new dimensions. Spill and unpopulation are somewhat related, and will probably be handled at the same time, but their time has not yet come. UNIQUE, at least for rows, was treating all cell (calculated) values as strings. This is not the same behavior as Excel, which will preserve datatypes, and treat int 3 and string 3 as unique values. Excel will, however, treat int 3 and float 3.0 as non-unique. Within UNIQUE, private function uniqueByRow is changed to try to preserve the the datatype when executing (it will probably treat 3.0 as int - I don't know how I can, or even if I should attempt to, do better - but no int nor float should be treated as a string).
Frustrating morning.
I think I should go back to bed.
ArrayFunctions2Test - the calculations seem too complicated for PhpSpreadsheet. The debug log is 21,300 lines, so I don't know how far I will get with it.
I think I have finally cracked how to prevent Excel from changing the formula to CSE format. It's hideous.
I think it's probably doable, but it might be fragile, especially if other features need to be enabled in this way. |
With a number of changes, PhpSpreadsheet can finally generate a spreadsheet which Excel will recognize as a Dynamic Array function rather than CSE. In particular, changes are needed to ContentTypes, workbook.xml.rels, cell definitions in the worksheet, and a new metadata.xml is added.
I think that if I can figure out SPILL, it will be time to move this out of draft status. |
Some examples submitted by @infojunkie have demonstrated that an anchor cell without a spill operator was not being handled consistently with Excel. This means that the calculated value of such a cell needs to be a scalar when using it as part of a formula without the spill operator, but as an array when using it with the spill operator or when just getting the cell's value. This is tricky. My solution seems awfully kludgey to me, but it does seem to work. I have another change that I will want to make in a day or so. When that change is pushed, I will take this back out of draft status, and will now aim for an install date of about August 6. This particular commit removes the changing of array return type during calculations. It's been this way for a very long time, but I don't understand why it should have been needed in the first place. It causes problems (you need to be sure to restore the original value even when, for example, you throw an exception during calculation). I am relieved that it caused a problem only for one test member. The TEXTSPLIT function really makes sense only when you are returning arrays as arrays. Its test now needs to set that value explicitly since the Calculation engine is no longer changing it under the covers. No other tests broke as a result of this change. One other test "broke" as a result of this commit. One of the tests for INDIRECT had been expecting a null value, and the test was commented with "Excel result is 0". The PhpSpreadsheet result is now 0 as well, so this would seem to be a bugfix rather than a breaking change.
@infojunkie Please try again with the latest commit. |
Thanks @oleibman running my same test above on the latest commit yields the expected result:
This latest fix of yours also fixed a follow-up crash I was encountering in my test script when I attempted to reference cells from the B column, which itself is a Great job!! |
Till now we have used a static variable/getter/setter to decide what type of result should be returned when a formula is evaluated and an array is the result. This is messy; it would be much better to use an instance variable instead. We cannot eliminate `setArrayReturnType` and `getArrayReturnType` because that would be a BC break. I am considering whether they should be deprecated. In the meantime, I have added a new instance property `instanceArrayReturnType` with getter and setter methods. The property is initially null, and, if it remains so when needed, the static property will be used instead. However, if it is set, its value will be used.
Ready again. New tentative install date August 7. |
@PowerKiKi @MarkBaker I would like to create (= "tag"?) release 2.2 next week to clear the deck for this change a few weeks later as the start of release 3.0. I haven't created a release before, so, if you want to do it, that would be fine; but, if you're okay with me trying it, I would like to make sure it happens at a time which is convenient for you to salvage things if I mess up. Can you suggest a time (with timezone) which would work for you next Wednesday or Thursday? |
I am not following the project as closely as I used to, but I also love coming back to cut releases 😍 But it's totally fine for you to do it. The process is meant to be simple. AFAIK there are only two things you can screw up:
I enabled GitHub notifications for new releases. So I suppose you can cut a release whenever you want (unless if Mark is working on something he'd like to include ?), and I'll be double-checking... PS: don't use prefix like |
@PowerKiKi I released 2.2.0. It's in packagist. However, CI tells me that the "release" step failed - see https://github.com/PHPOffice/PhpSpreadsheet/actions/runs/10077609508/job/27860519974. I don't know if that means something that needs to be done hasn't happened yet. Please take a look. |
I'm not seeing this PR in the release notes of 2.2.0, nor in the code of master. So if your intention was to merge and release this PR, it hasn't happened. |
@infojunkie 2.2.0 was "clearing the decks" for this change. My plan is for it to be part of a new major release (3.0.0). I still plan to merge this to master around August 7 (leaves time to make sure there are no major problems with 2.2.0), although I do not yet have a date in mind for 3.0.0. |
I am also not sure what happened. But it seems it didn't matter all that much anyway. I noticed this did not happen again with my just released 2.2.1: https://github.com/PHPOffice/PhpSpreadsheet/actions/runs/10140340976/job/28035354511 🤷 |
This will, I hope, be my last change prior to merge on August 7. PR is fully synced with master (except for this change), and, except for an emergency, I do not intend to merge anything else before this.
I had planned to merge this today, but first I need to evaluate issue #4128 (possible break in 2.2.0/2.2.1). New tentative date August 14. |
2.2.2 is confirmed to fix problem with Excel 2016, and no new 2.2.* problems have surfaced. Advancing merge date to Monday Aug. 12. |
Fix PHPOffice#4324. Serialization was explicity forbidden by PR PHPOffice#3199. This was in response to several issues, and concern that the Spreadsheet object contained non-serializable properties. This PR restores the ability to serialize a spreadsheet. Json serialization remains unsupported. Fix PHPOffice#1757, closed in Nov. 2023 but just reopened. At the time, Cell property `formulaAttributes` was stored as a SimpleXmlElement. Dynamic arrays PR PHPOffice#3962 defined that property as `null|array<string, string>` in the doc block. However, it left the formal Php type for the property as `mixed`. This PR changes the formal type to `?array`. Fix PHPOffice#1741, closed in Dec. 2020 but just reopened. Calculation property `referenceHelper` was defined as static, and, since static properties don't take part in serialization, this caused a problem after unserialization. There are at least 3 trivial ways to deal with this - make it an instance property, reinitialize it when unserialized using a wakeup method, or remove the property altogether. This PR uses the last of those 3. Calculation does have other static properties. Almost all of these deal with locale. So serialize/unserialize might wind up using a default locale when non-default is desired (but not necessarily required). If that is a problem for end-users, it will be a new one, and I will work on a solution if and when the time comes. Static property `returnArrayAsType` is potentially problematic. However, instance property `instanceArrayReturnType` is the preferred method of handling this, and using that will avoid any problems. Issue PHPOffice#932 also dealt with serialization. I do not have the wherewithal to investigate that issue. If it is not solved by this and the earlier PR's, I will have to leave it to others to re-raise it. Spreadsheet `copy` is now simplified to use serialize followed by unserialize. Formal tests are added. In addition, I have made a number of informal tests on very complicated spreadsheets, and it has performed correctly for all of them.
When the Dynamic Array PR PHPOffice#3962 was introduced, it left the default as Return Array as Value. At some point, the default should be changed to Return Array as Array. This would, of course, be a breaking change, one which will not be part of Release 4. However, it will possibly be part of Release 5. Rather than relying on the default setting, this PR explicitly sets Return Array as Value when tests require that setting. This will make it easier to identify potential breaks when the default is changed. The entire test suite will now succeed with either setting as default. In making these changes, a few minor problems were discovered with how Array as Array is handled. These are fixed with this PR.
This has come up a number of times, most recently with issue #3901, and also issue #3659, and issue #1834. It will certainly come up more often in days to come. Excel is changing formulas which PhpSpreadsheet has output as
=UNIQUE(A1:A19)
; Excel is processing the formula as it were=@UNIQUE(A1:A19)
. This behavior is explained, in part, by #3659 (comment). It is doing so in order to ensure that the function returns only a single value rather than an array of values, in case the spreadsheet is being processed (or possibly was created) by a less current version of Excel which cannot handle the array result.PhpSpreadsheet follows Excel to a certain extent; it defaults to returning a single calculated value when an array would be returned. Further, its support for outputting an array even when that default is overridden is incomplete. I am not prepared to do everything that Excel does for the array functions (details below), but this PR is a start in that direction. If the default is changed via:
When that is done,
getCalculatedValue
will return an array (no code change necessary). However, Writer/Xlsx will now be updated to look at that value, and if an array is returned in that circumstance, will indicate in the Xml that the result is an array and will include a reference to the bounds of the array. This gets us close, although not completely there, to what Excel does, and may be good enough for now. Excel will still mess with the formula, but now it will treat it as{=UNIQUE(A1:A19)}
. This means that the spreadsheet will now look correct; there will be superficial differences, but all cells will have the expected value.Technically, the major difference between what PhpSpreadsheet will output now, and what Excel does on its own, is that Excel supplies values in the xml for all the cells in the range. That would be difficult for PhpSpreadsheet to do; that could be a project for another day. Excel will treat the output from PhpSpreadsheet as "Array Formulas" (a.k.a. CSE (control shift enter) formulas because you need to use that combination of keys to manually enter them in older versions of Excel). Current versions of Excel will instead use "Dynamic Array Formulas". Dynamic Array Formulas can be changed by the user; Array Formulas need to be deleted and re-entered if you want to change them. I don't know what else might have to change to get Excel to use the latter for PhpSpreadsheet formulas, and I will probably not even try to look now, saving it for a future date.
Unit testing of this change uncovered a bug in Calculation::calculateCellValue. That routine saves off ArrayReturnType, and may change it, and is supposed to restore it. But it does not do the restore if the calculation throws an exception. It is changed to do so.
This is:
Checklist:
Why this change is needed?
Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.