-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Closed
Description
While trying to work with data from a ≈8.5 MiB xlsx file containing ≈96500 rows for 19 columns on one sheet (≈1.8 million cells, most of which contain data), we noticed significant slowdowns while reading sequentially, row by row using the rangeToArray method on the Worksheet class.
Through some tracing and after following call chains, we found the following issues:
- The
rangeToArrayYieldRowsmethod on theWorksheetclass uses a linear search (with an increment of 1) to find the right cell index, this is quick enough when reading at the start of the file but it gets progressively slower as it has to read further along the file. - The
Collection\Cellsclass always sorts its$indexarray when callinggetSortedCoordinatesandgetSortedCoordinatesInteven when no change was done to the array in between calls. - The
Collection\Cellsclass usesarray_keysandarray_valuesto get the keys and values of its$indexarray whengetCoordinates,getSortedCoordinates, andgetSortedCoordinatesIntare called, even when the array was not modified between calls to these methods.
We fixed these issues on our end temporarily by implementing the following changes, respectively:
- Using a binary search algorithm to approach the correct index value quickly and finding the proper value afterwards. The searched index value is not always present, which is why, we assume, a linear search was used in the first place, so we also need to fix the value after the binary search, which we did with a linear backwards search followed by a linear forward search, but it limits the iterations to just a row's worth of indices rather than the whole file's worth.
- Using a flag to mark when the
$indexarray is sorted, and clearing that flag whenever the array is changed. - Using arrays to cache the keys and values of the
$indexarray and clearing these whenever it is changed.
For reference, we changed the following methods of the Collections\Cells class:
deleteclears the keys and values caches (if the array is sorted, removing a value should not change that, if the array is not sorted, the flag should not be set, so this method probably shouldn't need to mark the array as needing to be sorted)getCoordinatesbuilds the keys cache if it does not exist and returns the keys cache.getSortedCoordinatessorts the array if it needs sorting, marks it as sorted, builds the keys cache if it does not exist, and returns the keys cache.getSortedCoordinatesIntsorts the array if it needs sorting, marks it as sorted, builds the values cache if it does not exist, and returns the values cache.cloneCellCollectionprobably needs to clear the sorted flag and both the keys and values cache arrays from the newly created collection (we did it anyway to be sure)addclears the sorted flag and both cache arraysunsetWorksheetCellsclears the sorted flag and both cache arrays
Through the modifications listed above, we improved our processing time from a few hours to a few minutes.
williamyiu and AIlkiv
Metadata
Metadata
Assignees
Labels
No labels