Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster ALP encode #924

Merged
merged 17 commits into from
Sep 25, 2024
Merged

faster ALP encode #924

merged 17 commits into from
Sep 25, 2024

Conversation

lwwmanning
Copy link
Member

@lwwmanning lwwmanning commented Sep 25, 2024

fixes #920

Consistently cuts encoding time by 10-50%.

Before the change:

Running benches/alp_compress.rs (target/release/deps/alp_compress-abbdaefc5eabf343)
Timer precision: 41 ns
alp_compress          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ alp_compress                     │               │               │               │         │
│  ├─ f32                           │               │               │               │         │
│  │  ├─ 100000       191.9 µs      │ 824.9 µs      │ 314.7 µs      │ 354 µs        │ 100     │ 100
│  │  ╰─ 10000000     21.39 ms      │ 28.95 ms      │ 21.71 ms      │ 21.89 ms      │ 100     │ 100
│  ╰─ f64                           │               │               │               │         │
│     ├─ 100000       236 µs        │ 353.7 µs      │ 238.4 µs      │ 246.4 µs      │ 100     │ 100
│     ╰─ 10000000     28.78 ms      │ 68.68 ms      │ 29.49 ms      │ 29.93 ms      │ 100     │ 100

After:

Running benches/alp_compress.rs (target/release/deps/alp_compress-abbdaefc5eabf343)
Timer precision: 41 ns
alp_compress          fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ alp_compress                     │               │               │               │         │
│  ├─ f32                           │               │               │               │         │
│  │  ├─ 100000       161 µs        │ 234.6 µs      │ 163.3 µs      │ 166 µs        │ 100     │ 100
│  │  ╰─ 10000000     18.72 ms      │ 21.54 ms      │ 19.07 ms      │ 19.14 ms      │ 100     │ 100
│  ╰─ f64                           │               │               │               │         │
│     ├─ 100000       182 µs        │ 346 µs        │ 183.9 µs      │ 187.9 µs      │ 100     │ 100
│     ╰─ 10000000     23.98 ms      │ 28.71 ms      │ 24.52 ms      │ 24.53 ms      │ 100     │ 100

@lwwmanning lwwmanning changed the title branchless ALP encode faster ALP encode Sep 25, 2024
@lwwmanning lwwmanning marked this pull request as ready for review September 25, 2024 14:35
encodings/alp/src/alp.rs Outdated Show resolved Hide resolved
encodings/alp/src/alp.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@a10y a10y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Copy link
Member

@robert3005 robert3005 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one small nit

encodings/alp/src/alp.rs Show resolved Hide resolved
@lwwmanning lwwmanning enabled auto-merge (squash) September 25, 2024 15:04
@lwwmanning lwwmanning merged commit a7fd730 into develop Sep 25, 2024
5 checks passed
@lwwmanning lwwmanning deleted the wm/branchless-alp branch September 25, 2024 15:17
}

// if there are no patches, we are done
if chunk_patch_count == 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to handle the edge case of 2 chunks where chunk 0 is all patches, chunk 1 has 0 patches... which won't fill

lwwmanning added a commit that referenced this pull request Sep 26, 2024
Realized that there's an unhandled edge case in #924, [commented
here](https://github.com/spiraldb/vortex/pull/924/files#r1776099681)

Essentially, on develop, if we have two chunks and the first chunk is
all patches and the second chunk has 0 patches, then the patched values
won't get filled in the encoded array. Not the end of the world (they're
presumably full of integer approximations that don't round-trip), but if
it's a case of outlier large values that are getting patched, then the
encoded values will end up bitpacking poorly.

This PR fixes that.
danking added a commit that referenced this pull request Feb 3, 2025
This PR trims invalid values from the patches and makes the patches
validity either AllValid (for nullable arrays) or NonNullable.

This microbenchmark doesn't reveal any clear improvements or
degradations. It seems to me mostly noise. In theory, this change should
make decompression a bit faster because validity is one place, but my
primary goal here is to make ALP array simpler: validity is in one
place, the encoded array.

### Benchmarks on latest commit: 

- PR: 7fb595b
- develop: 0a18498

parameter is: (number of elements, fraction patched, fraction valid).

Any ratio greater than 1.1 or less than 0.9 has a `  ***`

```
alp_compress                    │ PR median     │ develop median │ ratio
├─ compress_alp                 │               │                │
│  ├─ f32                       │               │                │
│  │  ├─ (100000, 0.0, 0.25)    │ 160.4 µs      │ 159.6 µs       │ 1.0050
│  │  ├─ (100000, 0.0, 0.95)    │ 145.9 µs      │ 143.8 µs       │ 1.0146
│  │  ├─ (100000, 0.0, 1.0)     │ 137.0 µs      │ 135.5 µs       │ 1.0110
│  │  ├─ (100000, 0.01, 0.25)   │ 227.7 µs      │ 230.7 µs       │ 0.9869
│  │  ├─ (100000, 0.01, 0.95)   │ 227.9 µs      │ 227.2 µs       │ 1.0030
│  │  ├─ (100000, 0.01, 1.0)    │ 226.6 µs      │ 227.5 µs       │ 0.9960
│  │  ├─ (100000, 0.1, 0.25)    │ 238.3 µs      │ 248.9 µs       │ 0.9574
│  │  ├─ (100000, 0.1, 0.95)    │ 238.2 µs      │ 269.8 µs       │ 0.8828  ***
│  │  ├─ (100000, 0.1, 1.0)     │ 230.6 µs      │ 231.9 µs       │ 0.9943
│  │  ├─ (10000000, 0.0, 0.25)  │ 14.17 ms      │ 13.77 ms       │ 1.0290
│  │  ├─ (10000000, 0.0, 0.95)  │ 14.16 ms      │ 13.8 ms        │ 1.0260
│  │  ├─ (10000000, 0.0, 1.0)   │ 14.0 ms       │ 12.47 ms       │ 1.1226  ***
│  │  ├─ (10000000, 0.01, 0.25) │ 22.29 ms      │ 23.13 ms       │ 0.9636
│  │  ├─ (10000000, 0.01, 0.95) │ 22.26 ms      │ 23.78 ms       │ 0.9360
│  │  ├─ (10000000, 0.01, 1.0)  │ 22.19 ms      │ 21.79 ms       │ 1.0183
│  │  ├─ (10000000, 0.1, 0.25)  │ 23.31 ms      │ 27.72 ms       │ 0.8409  ***
│  │  ├─ (10000000, 0.1, 0.95)  │ 23.4 ms       │ 27.47 ms       │ 0.8518  ***
│  │  ╰─ (10000000, 0.1, 1.0)   │ 22.99 ms      │ 22.31 ms       │ 1.0304
│  ╰─ f64                       │               │                │
│     ├─ (100000, 0.0, 0.25)    │ 165.2 µs      │ 165.4 µs       │ 0.9987
│     ├─ (100000, 0.0, 0.95)    │ 166.1 µs      │ 163.4 µs       │ 1.0165
│     ├─ (100000, 0.0, 1.0)     │ 164.7 µs      │ 179.9 µs       │ 0.9155
│     ├─ (100000, 0.01, 0.25)   │ 269.7 µs      │ 259.1 µs       │ 1.0409
│     ├─ (100000, 0.01, 0.95)   │ 270.5 µs      │ 259.6 µs       │ 1.0419
│     ├─ (100000, 0.01, 1.0)    │ 268.9 µs      │ 270.6 µs       │ 0.9937
│     ├─ (100000, 0.1, 0.25)    │ 281.7 µs      │ 281.3 µs       │ 1.0014
│     ├─ (100000, 0.1, 0.95)    │ 279.1 µs      │ 315.3 µs       │ 0.8851  ***
│     ├─ (100000, 0.1, 1.0)     │ 273.0 µs      │ 275.7 µs       │ 0.9902
│     ├─ (10000000, 0.0, 0.25)  │ 16.16 ms      │ 15.86 ms       │ 1.0189
│     ├─ (10000000, 0.0, 0.95)  │ 16.19 ms      │ 15.75 ms       │ 1.0279
│     ├─ (10000000, 0.0, 1.0)   │ 16.2 ms       │ 15.83 ms       │ 1.0233
│     ├─ (10000000, 0.01, 0.25) │ 25.29 ms      │ 25.77 ms       │ 0.9813
│     ├─ (10000000, 0.01, 0.95) │ 25.74 ms      │ 25.94 ms       │ 0.9922
│     ├─ (10000000, 0.01, 1.0)  │ 25.54 ms      │ 25.32 ms       │ 1.0086
│     ├─ (10000000, 0.1, 0.25)  │ 26.89 ms      │ 30.73 ms       │ 0.8750  ***
│     ├─ (10000000, 0.1, 0.95)  │ 27.05 ms      │ 30.53 ms       │ 0.8860  ***
│     ╰─ (10000000, 0.1, 1.0)   │ 26.22 ms      │ 25.98 ms       │ 1.0092
├─ decompress_alp               │               │                │
│  ├─ f32                       │               │                │
│  │  ├─ (100000, 0.0, 0.25)    │ 12.24 µs      │ 12.33 µs       │ 0.9927
│  │  ├─ (100000, 0.0, 0.95)    │ 12.24 µs      │ 12.16 µs       │ 1.0065
│  │  ├─ (100000, 0.0, 1.0)     │ 12.2 µs       │ 12.16 µs       │ 1.0032
│  │  ├─ (100000, 0.01, 0.25)   │ 15.12 µs      │ 14.04 µs       │ 1.0769
│  │  ├─ (100000, 0.01, 0.95)   │ 14.95 µs      │ 14.81 µs       │ 1.0094
│  │  ├─ (100000, 0.01, 1.0)    │ 13.43 µs      │ 13.24 µs       │ 1.0143
│  │  ├─ (100000, 0.1, 0.25)    │ 26.08 µs      │ 17.41 µs       │ 1.4979  ***
│  │  ├─ (100000, 0.1, 0.95)    │ 25.87 µs      │ 25.04 µs       │ 1.0331
│  │  ├─ (100000, 0.1, 1.0)     │ 19.33 µs      │ 21.08 µs       │ 0.9169
│  │  ├─ (10000000, 0.0, 0.25)  │ 2.067 ms      │ 2.057 ms       │ 1.0048
│  │  ├─ (10000000, 0.0, 0.95)  │ 2.068 ms      │ 2.055 ms       │ 1.0063
│  │  ├─ (10000000, 0.0, 1.0)   │ 2.07 ms       │ 1.261 ms       │ 1.6415  ***
│  │  ├─ (10000000, 0.01, 0.25) │ 1.51 ms       │ 2.113 ms       │ 0.7146  ***
│  │  ├─ (10000000, 0.01, 0.95) │ 1.477 ms      │ 2.621 ms       │ 0.5635  ***
│  │  ├─ (10000000, 0.01, 1.0)  │ 1.35 ms       │ 1.346 ms       │ 1.0029
│  │  ├─ (10000000, 0.1, 0.25)  │ 3.765 ms      │ 2.58 ms        │ 1.4593  ***
│  │  ├─ (10000000, 0.1, 0.95)  │ 2.784 ms      │ 3.28 ms        │ 0.8487  ***
│  │  ╰─ (10000000, 0.1, 1.0)   │ 1.764 ms      │ 1.754 ms       │ 1.0057
│  ╰─ f64                       │               │                │
│     ├─ (100000, 0.0, 0.25)    │ 23.33 µs      │ 23.45 µs       │ 0.9948
│     ├─ (100000, 0.0, 0.95)    │ 23.41 µs      │ 23.33 µs       │ 1.0034
│     ├─ (100000, 0.0, 1.0)     │ 23.33 µs      │ 23.49 µs       │ 0.9931
│     ├─ (100000, 0.01, 0.25)   │ 25.58 µs      │ 24.66 µs       │ 1.0373
│     ├─ (100000, 0.01, 0.95)   │ 25.58 µs      │ 25.79 µs       │ 0.9918
│     ├─ (100000, 0.01, 1.0)    │ 24.2 µs       │ 24.62 µs       │ 0.9829
│     ├─ (100000, 0.1, 0.25)    │ 39.83 µs      │ 27.87 µs       │ 1.4291  ***
│     ├─ (100000, 0.1, 0.95)    │ 39.7 µs       │ 39.56 µs       │ 1.0035
│     ├─ (100000, 0.1, 1.0)     │ 34.43 µs      │ 31.66 µs       │ 1.0874
│     ├─ (10000000, 0.0, 0.25)  │ 4.246 ms      │ 4.239 ms       │ 1.0016
│     ├─ (10000000, 0.0, 0.95)  │ 4.227 ms      │ 4.292 ms       │ 0.9848
│     ├─ (10000000, 0.0, 1.0)   │ 4.227 ms      │ 4.246 ms       │ 0.9955
│     ├─ (10000000, 0.01, 0.25) │ 4.696 ms      │ 4.356 ms       │ 1.0780
│     ├─ (10000000, 0.01, 0.95) │ 4.933 ms      │ 4.637 ms       │ 1.0638
│     ├─ (10000000, 0.01, 1.0)  │ 4.538 ms      │ 4.545 ms       │ 0.9984
│     ├─ (10000000, 0.1, 0.25)  │ 7.23 ms       │ 5.304 ms       │ 1.3631  ***
│     ├─ (10000000, 0.1, 0.95)  │ 6.227 ms      │ 5.913 ms       │ 1.0531
│     ╰─ (10000000, 0.1, 1.0)   │ 5.207 ms      │ 5.29 ms        │ 0.9843
```

### Benchmarks before reverting to develop's chunking code
<details>

[1] Seems like this PR is about the same except for compressing really
large f64 arrays. The PR that introduced chunking, #924, reported
substantially larger reductions (~5ms of 29ms) in time than this
increase of ~1ms (of 17ms).
```
alp_compress               │ PR median     │ PR mean   │ develop median │ develop mean │
├─ compress_alp            │               │           │                │              │
│  ├─ f32                  │               │           │                │              │
│  │  ├─ (100000, 0.25)    │ 136.4 µs      │ 137.9 µs  │ 143 µs         │ 145.9 µs     │
│  │  ├─ (100000, 0.95)    │ 136.3 µs      │ 137.1 µs  │ 133.1 µs       │ 134.3 µs     │
│  │  ├─ (100000, 1.0)     │ 136 µs        │ 137.3 µs  │ 133.6 µs       │ 134.6 µs     │
│  │  ├─ (10000000, 0.25)  │ 13.54 ms      │ 13.67 ms  │ 13.74 ms       │ 13.84 ms     │
│  │  ├─ (10000000, 0.95)  │ 13.54 ms      │ 13.64 ms  │ 13.49 ms       │ 13.59 ms     │
│  │  ╰─ (10000000, 1.0)   │ 13.47 ms      │ 13.57 ms  │ 13.58 ms       │ 13.73 ms     │
│  ╰─ f64                  │               │           │                │              │
│     ├─ (100000, 0.25)    │ 152.5 µs      │ 153.9 µs  │ 166.1 µs       │ 167.2 µs     │
│     ├─ (100000, 0.95)    │ 152.5 µs      │ 154.3 µs  │ 166.4 µs       │ 167 µs       │
│     ├─ (100000, 1.0)     │ 151.5 µs      │ 153 µs    │ 166.2 µs       │ 166.9 µs     │
│     ├─ (10000000, 0.25)  │ 16.89 ms      │ 17 ms     │ 15.87 ms       │ 15.91 ms     │
│     ├─ (10000000, 0.95)  │ 16.96 ms      │ 17.19 ms  │ 16.14 ms       │ 16.12 ms     │
│     ╰─ (10000000, 1.0)   │ 16.93 ms      │ 16.99 ms  │ 16.15 ms       │ 16.18 ms     │
╰─ decompress_alp          │               │           │                │              │
   ├─ f32                  │               │           │                │              │
   │  ├─ (100000, 0.25)    │ 12.33 µs      │ 12.4 µs   │ 12.37 µs       │ 12.55 µs     │
   │  ├─ (100000, 0.95)    │ 11.99 µs      │ 12.01 µs  │ 12.45 µs       │ 12.58 µs     │
   │  ├─ (100000, 1.0)     │ 11.95 µs      │ 11.98 µs  │ 11.91 µs       │ 11.96 µs     │
   │  ├─ (10000000, 0.25)  │ 1.233 ms      │ 1.24 ms   │ 2.064 ms       │ 2.088 ms     │
   │  ├─ (10000000, 0.95)  │ 1.232 ms      │ 1.235 ms  │ 2.063 ms       │ 2.094 ms     │
   │  ╰─ (10000000, 1.0)   │ 1.233 ms      │ 1.236 ms  │ 2.061 ms       │ 2.088 ms     │
   ╰─ f64                  │               │           │                │              │
      ├─ (100000, 0.25)    │ 23.29 µs      │ 23.46 µs  │ 23.33 µs       │ 23.4 µs      │
      ├─ (100000, 0.95)    │ 22.87 µs      │ 22.92 µs  │ 22.99 µs       │ 23.06 µs     │
      ├─ (100000, 1.0)     │ 22.87 µs      │ 23 µs     │ 22.95 µs       │ 23 µs        │
      ├─ (10000000, 0.25)  │ 4.254 ms      │ 4.393 ms  │ 4.239 ms       │ 4.28 ms      │
      ├─ (10000000, 0.95)  │ 4.703 ms      │ 4.639 ms  │ 4.27 ms        │ 4.437 ms     │
      ╰─ (10000000, 1.0)   │ 4.479 ms      │ 4.58 ms   │ 4.684 ms       │ 4.618 ms     │
```

</details>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ALP has a lot of branching
4 participants