-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
faster ALP encode #924
faster ALP encode #924
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one small nit
} | ||
|
||
// if there are no patches, we are done | ||
if chunk_patch_count == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to handle the edge case of 2 chunks where chunk 0 is all patches, chunk 1 has 0 patches... which won't fill
Realized that there's an unhandled edge case in #924, [commented here](https://github.com/spiraldb/vortex/pull/924/files#r1776099681) Essentially, on develop, if we have two chunks and the first chunk is all patches and the second chunk has 0 patches, then the patched values won't get filled in the encoded array. Not the end of the world (they're presumably full of integer approximations that don't round-trip), but if it's a case of outlier large values that are getting patched, then the encoded values will end up bitpacking poorly. This PR fixes that.
This PR trims invalid values from the patches and makes the patches validity either AllValid (for nullable arrays) or NonNullable. This microbenchmark doesn't reveal any clear improvements or degradations. It seems to me mostly noise. In theory, this change should make decompression a bit faster because validity is one place, but my primary goal here is to make ALP array simpler: validity is in one place, the encoded array. ### Benchmarks on latest commit: - PR: 7fb595b - develop: 0a18498 parameter is: (number of elements, fraction patched, fraction valid). Any ratio greater than 1.1 or less than 0.9 has a ` ***` ``` alp_compress │ PR median │ develop median │ ratio ├─ compress_alp │ │ │ │ ├─ f32 │ │ │ │ │ ├─ (100000, 0.0, 0.25) │ 160.4 µs │ 159.6 µs │ 1.0050 │ │ ├─ (100000, 0.0, 0.95) │ 145.9 µs │ 143.8 µs │ 1.0146 │ │ ├─ (100000, 0.0, 1.0) │ 137.0 µs │ 135.5 µs │ 1.0110 │ │ ├─ (100000, 0.01, 0.25) │ 227.7 µs │ 230.7 µs │ 0.9869 │ │ ├─ (100000, 0.01, 0.95) │ 227.9 µs │ 227.2 µs │ 1.0030 │ │ ├─ (100000, 0.01, 1.0) │ 226.6 µs │ 227.5 µs │ 0.9960 │ │ ├─ (100000, 0.1, 0.25) │ 238.3 µs │ 248.9 µs │ 0.9574 │ │ ├─ (100000, 0.1, 0.95) │ 238.2 µs │ 269.8 µs │ 0.8828 *** │ │ ├─ (100000, 0.1, 1.0) │ 230.6 µs │ 231.9 µs │ 0.9943 │ │ ├─ (10000000, 0.0, 0.25) │ 14.17 ms │ 13.77 ms │ 1.0290 │ │ ├─ (10000000, 0.0, 0.95) │ 14.16 ms │ 13.8 ms │ 1.0260 │ │ ├─ (10000000, 0.0, 1.0) │ 14.0 ms │ 12.47 ms │ 1.1226 *** │ │ ├─ (10000000, 0.01, 0.25) │ 22.29 ms │ 23.13 ms │ 0.9636 │ │ ├─ (10000000, 0.01, 0.95) │ 22.26 ms │ 23.78 ms │ 0.9360 │ │ ├─ (10000000, 0.01, 1.0) │ 22.19 ms │ 21.79 ms │ 1.0183 │ │ ├─ (10000000, 0.1, 0.25) │ 23.31 ms │ 27.72 ms │ 0.8409 *** │ │ ├─ (10000000, 0.1, 0.95) │ 23.4 ms │ 27.47 ms │ 0.8518 *** │ │ ╰─ (10000000, 0.1, 1.0) │ 22.99 ms │ 22.31 ms │ 1.0304 │ ╰─ f64 │ │ │ │ ├─ (100000, 0.0, 0.25) │ 165.2 µs │ 165.4 µs │ 0.9987 │ ├─ (100000, 0.0, 0.95) │ 166.1 µs │ 163.4 µs │ 1.0165 │ ├─ (100000, 0.0, 1.0) │ 164.7 µs │ 179.9 µs │ 0.9155 │ ├─ (100000, 0.01, 0.25) │ 269.7 µs │ 259.1 µs │ 1.0409 │ ├─ (100000, 0.01, 0.95) │ 270.5 µs │ 259.6 µs │ 1.0419 │ ├─ (100000, 0.01, 1.0) │ 268.9 µs │ 270.6 µs │ 0.9937 │ ├─ (100000, 0.1, 0.25) │ 281.7 µs │ 281.3 µs │ 1.0014 │ ├─ (100000, 0.1, 0.95) │ 279.1 µs │ 315.3 µs │ 0.8851 *** │ ├─ (100000, 0.1, 1.0) │ 273.0 µs │ 275.7 µs │ 0.9902 │ ├─ (10000000, 0.0, 0.25) │ 16.16 ms │ 15.86 ms │ 1.0189 │ ├─ (10000000, 0.0, 0.95) │ 16.19 ms │ 15.75 ms │ 1.0279 │ ├─ (10000000, 0.0, 1.0) │ 16.2 ms │ 15.83 ms │ 1.0233 │ ├─ (10000000, 0.01, 0.25) │ 25.29 ms │ 25.77 ms │ 0.9813 │ ├─ (10000000, 0.01, 0.95) │ 25.74 ms │ 25.94 ms │ 0.9922 │ ├─ (10000000, 0.01, 1.0) │ 25.54 ms │ 25.32 ms │ 1.0086 │ ├─ (10000000, 0.1, 0.25) │ 26.89 ms │ 30.73 ms │ 0.8750 *** │ ├─ (10000000, 0.1, 0.95) │ 27.05 ms │ 30.53 ms │ 0.8860 *** │ ╰─ (10000000, 0.1, 1.0) │ 26.22 ms │ 25.98 ms │ 1.0092 ├─ decompress_alp │ │ │ │ ├─ f32 │ │ │ │ │ ├─ (100000, 0.0, 0.25) │ 12.24 µs │ 12.33 µs │ 0.9927 │ │ ├─ (100000, 0.0, 0.95) │ 12.24 µs │ 12.16 µs │ 1.0065 │ │ ├─ (100000, 0.0, 1.0) │ 12.2 µs │ 12.16 µs │ 1.0032 │ │ ├─ (100000, 0.01, 0.25) │ 15.12 µs │ 14.04 µs │ 1.0769 │ │ ├─ (100000, 0.01, 0.95) │ 14.95 µs │ 14.81 µs │ 1.0094 │ │ ├─ (100000, 0.01, 1.0) │ 13.43 µs │ 13.24 µs │ 1.0143 │ │ ├─ (100000, 0.1, 0.25) │ 26.08 µs │ 17.41 µs │ 1.4979 *** │ │ ├─ (100000, 0.1, 0.95) │ 25.87 µs │ 25.04 µs │ 1.0331 │ │ ├─ (100000, 0.1, 1.0) │ 19.33 µs │ 21.08 µs │ 0.9169 │ │ ├─ (10000000, 0.0, 0.25) │ 2.067 ms │ 2.057 ms │ 1.0048 │ │ ├─ (10000000, 0.0, 0.95) │ 2.068 ms │ 2.055 ms │ 1.0063 │ │ ├─ (10000000, 0.0, 1.0) │ 2.07 ms │ 1.261 ms │ 1.6415 *** │ │ ├─ (10000000, 0.01, 0.25) │ 1.51 ms │ 2.113 ms │ 0.7146 *** │ │ ├─ (10000000, 0.01, 0.95) │ 1.477 ms │ 2.621 ms │ 0.5635 *** │ │ ├─ (10000000, 0.01, 1.0) │ 1.35 ms │ 1.346 ms │ 1.0029 │ │ ├─ (10000000, 0.1, 0.25) │ 3.765 ms │ 2.58 ms │ 1.4593 *** │ │ ├─ (10000000, 0.1, 0.95) │ 2.784 ms │ 3.28 ms │ 0.8487 *** │ │ ╰─ (10000000, 0.1, 1.0) │ 1.764 ms │ 1.754 ms │ 1.0057 │ ╰─ f64 │ │ │ │ ├─ (100000, 0.0, 0.25) │ 23.33 µs │ 23.45 µs │ 0.9948 │ ├─ (100000, 0.0, 0.95) │ 23.41 µs │ 23.33 µs │ 1.0034 │ ├─ (100000, 0.0, 1.0) │ 23.33 µs │ 23.49 µs │ 0.9931 │ ├─ (100000, 0.01, 0.25) │ 25.58 µs │ 24.66 µs │ 1.0373 │ ├─ (100000, 0.01, 0.95) │ 25.58 µs │ 25.79 µs │ 0.9918 │ ├─ (100000, 0.01, 1.0) │ 24.2 µs │ 24.62 µs │ 0.9829 │ ├─ (100000, 0.1, 0.25) │ 39.83 µs │ 27.87 µs │ 1.4291 *** │ ├─ (100000, 0.1, 0.95) │ 39.7 µs │ 39.56 µs │ 1.0035 │ ├─ (100000, 0.1, 1.0) │ 34.43 µs │ 31.66 µs │ 1.0874 │ ├─ (10000000, 0.0, 0.25) │ 4.246 ms │ 4.239 ms │ 1.0016 │ ├─ (10000000, 0.0, 0.95) │ 4.227 ms │ 4.292 ms │ 0.9848 │ ├─ (10000000, 0.0, 1.0) │ 4.227 ms │ 4.246 ms │ 0.9955 │ ├─ (10000000, 0.01, 0.25) │ 4.696 ms │ 4.356 ms │ 1.0780 │ ├─ (10000000, 0.01, 0.95) │ 4.933 ms │ 4.637 ms │ 1.0638 │ ├─ (10000000, 0.01, 1.0) │ 4.538 ms │ 4.545 ms │ 0.9984 │ ├─ (10000000, 0.1, 0.25) │ 7.23 ms │ 5.304 ms │ 1.3631 *** │ ├─ (10000000, 0.1, 0.95) │ 6.227 ms │ 5.913 ms │ 1.0531 │ ╰─ (10000000, 0.1, 1.0) │ 5.207 ms │ 5.29 ms │ 0.9843 ``` ### Benchmarks before reverting to develop's chunking code <details> [1] Seems like this PR is about the same except for compressing really large f64 arrays. The PR that introduced chunking, #924, reported substantially larger reductions (~5ms of 29ms) in time than this increase of ~1ms (of 17ms). ``` alp_compress │ PR median │ PR mean │ develop median │ develop mean │ ├─ compress_alp │ │ │ │ │ │ ├─ f32 │ │ │ │ │ │ │ ├─ (100000, 0.25) │ 136.4 µs │ 137.9 µs │ 143 µs │ 145.9 µs │ │ │ ├─ (100000, 0.95) │ 136.3 µs │ 137.1 µs │ 133.1 µs │ 134.3 µs │ │ │ ├─ (100000, 1.0) │ 136 µs │ 137.3 µs │ 133.6 µs │ 134.6 µs │ │ │ ├─ (10000000, 0.25) │ 13.54 ms │ 13.67 ms │ 13.74 ms │ 13.84 ms │ │ │ ├─ (10000000, 0.95) │ 13.54 ms │ 13.64 ms │ 13.49 ms │ 13.59 ms │ │ │ ╰─ (10000000, 1.0) │ 13.47 ms │ 13.57 ms │ 13.58 ms │ 13.73 ms │ │ ╰─ f64 │ │ │ │ │ │ ├─ (100000, 0.25) │ 152.5 µs │ 153.9 µs │ 166.1 µs │ 167.2 µs │ │ ├─ (100000, 0.95) │ 152.5 µs │ 154.3 µs │ 166.4 µs │ 167 µs │ │ ├─ (100000, 1.0) │ 151.5 µs │ 153 µs │ 166.2 µs │ 166.9 µs │ │ ├─ (10000000, 0.25) │ 16.89 ms │ 17 ms │ 15.87 ms │ 15.91 ms │ │ ├─ (10000000, 0.95) │ 16.96 ms │ 17.19 ms │ 16.14 ms │ 16.12 ms │ │ ╰─ (10000000, 1.0) │ 16.93 ms │ 16.99 ms │ 16.15 ms │ 16.18 ms │ ╰─ decompress_alp │ │ │ │ │ ├─ f32 │ │ │ │ │ │ ├─ (100000, 0.25) │ 12.33 µs │ 12.4 µs │ 12.37 µs │ 12.55 µs │ │ ├─ (100000, 0.95) │ 11.99 µs │ 12.01 µs │ 12.45 µs │ 12.58 µs │ │ ├─ (100000, 1.0) │ 11.95 µs │ 11.98 µs │ 11.91 µs │ 11.96 µs │ │ ├─ (10000000, 0.25) │ 1.233 ms │ 1.24 ms │ 2.064 ms │ 2.088 ms │ │ ├─ (10000000, 0.95) │ 1.232 ms │ 1.235 ms │ 2.063 ms │ 2.094 ms │ │ ╰─ (10000000, 1.0) │ 1.233 ms │ 1.236 ms │ 2.061 ms │ 2.088 ms │ ╰─ f64 │ │ │ │ │ ├─ (100000, 0.25) │ 23.29 µs │ 23.46 µs │ 23.33 µs │ 23.4 µs │ ├─ (100000, 0.95) │ 22.87 µs │ 22.92 µs │ 22.99 µs │ 23.06 µs │ ├─ (100000, 1.0) │ 22.87 µs │ 23 µs │ 22.95 µs │ 23 µs │ ├─ (10000000, 0.25) │ 4.254 ms │ 4.393 ms │ 4.239 ms │ 4.28 ms │ ├─ (10000000, 0.95) │ 4.703 ms │ 4.639 ms │ 4.27 ms │ 4.437 ms │ ╰─ (10000000, 1.0) │ 4.479 ms │ 4.58 ms │ 4.684 ms │ 4.618 ms │ ``` </details>
fixes #920
Consistently cuts encoding time by 10-50%.
Before the change:
After: