Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add value_type to Column.from_vector and expected_value_type to Column.map and Column.zip #7637

Merged
merged 32 commits into from
Aug 31, 2023

Conversation

radeusgd
Copy link
Member

@radeusgd radeusgd commented Aug 22, 2023

Pull Request Description

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

  • The documentation has been updated, if necessary.
  • Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
  • All code follows the
    Scala,
    Java,
    and
    Rust
    style guides. In case you are using a language not listed above, follow the Rust style guide.
  • All code has been tested:
    • Unit tests have been written where possible.
    • If GUI codebase was changed, the GUI was tested when built using ./run ide build.

@radeusgd radeusgd self-assigned this Aug 22, 2023
@radeusgd radeusgd changed the base branch from develop to wip/radeusgd/5159-new-inmemory-value-types August 22, 2023 17:16
Base automatically changed from wip/radeusgd/5159-new-inmemory-value-types to develop August 22, 2023 18:10
@radeusgd radeusgd force-pushed the wip/radeusgd/6111-value-type-to-column-from_vector branch from 09469dd to 748e299 Compare August 23, 2023 11:25
@radeusgd radeusgd force-pushed the wip/radeusgd/6111-value-type-to-column-from_vector branch from 748e299 to b2f7c30 Compare August 23, 2023 17:53
@radeusgd radeusgd marked this pull request as ready for review August 23, 2023 17:54
@@ -15,6 +15,10 @@ public static TextType variableLengthWithLimit(long maxLength) {
}

public boolean fits(String string) {
if (string == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is null a valid representation for an empty string? It seems like this should be filtered out earlier.

Copy link
Member Author

@radeusgd radeusgd Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an empty string, it's a missing string and it is valid (our columns contain nulls for missing values in a row).

h x = "_"+x.to_text
r4 = c1.map h expected_value_type=(Value_Type.Char variable_length=False size=3)
r4.should_fail_with Invalid_Value_Type
r4.catch.to_display_text . should_contain "Expected type Char (fixed length, size=3), but got a value _1 of type Char (fixed length, size=2)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious -- why is .catch needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would likely work without it, but on principle I want to catch the dataflow error and lift it to value and inspect its display text, not the display text of the 'error value'. It's subtle and in case it is should_contain does not really matter. It would matter if I used should_equal - without catch I'd then get a Error: prefix.

@radeusgd
Copy link
Member Author

radeusgd commented Aug 28, 2023

As suggested by @JaroslavTulach I've added a benchmark comparing the performance of Column.from_vector depending on the expected_value_type.

The results were as follows:

expected_value_type Average iteration time [ms]
Integer Bits_64 33.186
Integer Bits_16 32.544
Float 40.593
Auto 104.734

We can see that specifying the data-type can be as much as 3x faster than relying on Auto inference.

Given past experiments in Exploratory_Benchmarks, I suspect a big chunk of that difference is the ability to skip the more expensive polyglot conversion code that is required in Auto to support date-time values (if we expect the type to be integer, we can discard any non-fitting values, so we no longer need the special handling). I assume that if we ever want to improve this, we could try adding some more 'speculation' having an Auto fast-path that does a bit less checks until we have to fall into the date-time support slower path.

The slight overhead of Float is probably the cost of converting every value into a floating point number.

Curiously the timings between 16-bit and 64-bit values are comparable (16-bit part seemed to even be faster in my run, but I think it may be due to warmup, the timings are very similar). Normally, I'd expect 16-bit to be slower, because they have the added check that ensures that the integer values fit in the 16-bit numeric range. Somehow, there is no noticeable difference.

I guess this may suggest there is room for optimization for the 64-bit case (it should be doing slightly less work, so it seems it should be faster). OR it may be that the JIT is very good and given all data in our benchmark is fitting the 16-bit limit, it may be pipelining the check very well. IF we want to investigate this further, we could want to add benchmarks where some values do not fit the target type to see how that affects the performance.

Raw results
Found 4 cases to execute
Benchmarking 'Column_from_vector_1000000.Integers_type_Integer_64_bit' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6083.6866 ms
Warmup invocations: 146
Warmup avg time:    41.274 ms
Measurement duration:    6013.2974 ms
Measurement invocations: 181
Measurement avg time:    33.186 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Integer_64_bit' finished in 12108.774 ms
Benchmarking 'Column_from_vector_1000000.Integers_type_Integer_checked_16_bit' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6045.9561 ms
Warmup invocations: 99
Warmup avg time:    60.943 ms
Measurement duration:    6027.8112 ms
Measurement invocations: 185
Measurement avg time:    32.544 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Integer_checked_16_bit' finished in 12076.45 ms
Benchmarking 'Column_from_vector_1000000.Integers_type_Float' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6011.7149 ms
Warmup invocations: 88
Warmup avg time:    68.246 ms
Measurement duration:    6010.6347 ms
Measurement invocations: 148
Measurement avg time:    40.593 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Float' finished in 12025.853 ms
Benchmarking 'Column_from_vector_1000000.Integers_type_Auto' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6143.0352 ms
Warmup invocations: 35
Warmup avg time:    175.502 ms
Measurement duration:    6074.8422 ms
Measurement invocations: 58
Measurement avg time:    104.734 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Auto' finished in 12219.448 ms

@JaroslavTulach
Copy link
Member

timings between 16-bit and 64-bit values are comparable

If the storage is long[] then there should be minimal difference. And there is. One would need IGV to find what is the cause.

We can see that specifying the data-type can be as much as 3x faster than relying on Auto inference.

The difference here is so huge that using VisualVM Polyglot Sampler could highlight where the time is spent.

Copy link
Member

@JaroslavTulach JaroslavTulach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Akirathan will be happy to see the benchmark code.

Copy link
Member

@jdunkerley jdunkerley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@radeusgd radeusgd added CI: Ready to merge This PR is eligible for automatic merge CI: Clean build required CI runners will be cleaned before and after this PR is built. labels Aug 31, 2023
@mergify mergify bot merged commit 255b424 into develop Aug 31, 2023
24 checks passed
@mergify mergify bot deleted the wip/radeusgd/6111-value-type-to-column-from_vector branch August 31, 2023 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI: Clean build required CI runners will be cleaned before and after this PR is built. CI: Ready to merge This PR is eligible for automatic merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add value_type support to from_vector, map and zip in in-memory Column.
4 participants