Add `value_type` to `Column.from_vector` and `expected_value_type` to `Column.map` and `Column.zip` #7637

radeusgd · 2023-08-22T17:15:47Z

Pull Request Description

Closes Add value_type support to from_vector, map and zip in in-memory Column. #6111
Aligns semantics of handling Mixed columns.
- Now, if an operation like iif or fill_nothing is given a Mixed column, the result will also be Mixed regardless of the inferred_precise_value_type.
Enables a few old tests that were pending but could be enabled since the types work is advanced enough.

Important Notes

Checklist

Please ensure that the following checklist has been satisfied before submitting the PR:

The documentation has been updated, if necessary.
Screenshots/screencasts have been attached, if there are any visual changes. For interactive or animated visual changes, a screencast is preferred.
All code follows the
Scala,
Java,
and
Rust
style guides. In case you are using a language not listed above, follow the Rust style guide.
All code has been tested:
- Unit tests have been written where possible.
- If GUI codebase was changed, the GUI was tested when built using ./run ide build.

GregoryTravis · 2023-08-23T18:12:12Z

std-bits/table/src/main/java/org/enso/table/data/column/storage/type/TextType.java

@@ -15,6 +15,10 @@ public static TextType variableLengthWithLimit(long maxLength) {
  }

  public boolean fits(String string) {
+    if (string == null) {


Is null a valid representation for an empty string? It seems like this should be filtered out earlier.

It's not an empty string, it's a missing string and it is valid (our columns contain nulls for missing values in a row).

GregoryTravis · 2023-08-23T18:16:28Z

test/Table_Tests/src/Common_Table_Operations/Map_Spec.enso

+                h x = "_"+x.to_text
+                r4 = c1.map h expected_value_type=(Value_Type.Char variable_length=False size=3)
+                r4.should_fail_with Invalid_Value_Type
+                r4.catch.to_display_text . should_contain "Expected type Char (fixed length, size=3), but got a value _1 of type Char (fixed length, size=2)"


Curious -- why is .catch needed here?

It would likely work without it, but on principle I want to catch the dataflow error and lift it to value and inspect its display text, not the display text of the 'error value'. It's subtle and in case it is should_contain does not really matter. It would matter if I used should_equal - without catch I'd then get a Error: prefix.

test/Table_Tests/src/Common_Table_Operations/Missing_Values_Spec.enso

…om_vector

…ected types

radeusgd · 2023-08-28T17:05:00Z

As suggested by @JaroslavTulach I've added a benchmark comparing the performance of Column.from_vector depending on the expected_value_type.

The results were as follows:

expected_value_type	Average iteration time [ms]
Integer Bits_64	33.186
Integer Bits_16	32.544
Float	40.593
Auto	104.734

We can see that specifying the data-type can be as much as 3x faster than relying on Auto inference.

Given past experiments in Exploratory_Benchmarks, I suspect a big chunk of that difference is the ability to skip the more expensive polyglot conversion code that is required in Auto to support date-time values (if we expect the type to be integer, we can discard any non-fitting values, so we no longer need the special handling). I assume that if we ever want to improve this, we could try adding some more 'speculation' having an Auto fast-path that does a bit less checks until we have to fall into the date-time support slower path.

The slight overhead of Float is probably the cost of converting every value into a floating point number.

Curiously the timings between 16-bit and 64-bit values are comparable (16-bit part seemed to even be faster in my run, but I think it may be due to warmup, the timings are very similar). Normally, I'd expect 16-bit to be slower, because they have the added check that ensures that the integer values fit in the 16-bit numeric range. Somehow, there is no noticeable difference.

I guess this may suggest there is room for optimization for the 64-bit case (it should be doing slightly less work, so it seems it should be faster). OR it may be that the JIT is very good and given all data in our benchmark is fitting the 16-bit limit, it may be pipelining the check very well. IF we want to investigate this further, we could want to add benchmarks where some values do not fit the target type to see how that affects the performance.

Raw results

Found 4 cases to execute
Benchmarking 'Column_from_vector_1000000.Integers_type_Integer_64_bit' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6083.6866 ms
Warmup invocations: 146
Warmup avg time:    41.274 ms
Measurement duration:    6013.2974 ms
Measurement invocations: 181
Measurement avg time:    33.186 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Integer_64_bit' finished in 12108.774 ms
Benchmarking 'Column_from_vector_1000000.Integers_type_Integer_checked_16_bit' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6045.9561 ms
Warmup invocations: 99
Warmup avg time:    60.943 ms
Measurement duration:    6027.8112 ms
Measurement invocations: 185
Measurement avg time:    32.544 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Integer_checked_16_bit' finished in 12076.45 ms
Benchmarking 'Column_from_vector_1000000.Integers_type_Float' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6011.7149 ms
Warmup invocations: 88
Warmup avg time:    68.246 ms
Measurement duration:    6010.6347 ms
Measurement invocations: 148
Measurement avg time:    40.593 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Float' finished in 12025.853 ms
Benchmarking 'Column_from_vector_1000000.Integers_type_Auto' with configuration: [warmup={2 iterations, 3 seconds each}, measurement={2 iterations, 3 seconds each}]
Warmup duration:    6143.0352 ms
Warmup invocations: 35
Warmup avg time:    175.502 ms
Measurement duration:    6074.8422 ms
Measurement invocations: 58
Measurement avg time:    104.734 ms
Benchmark 'Column_from_vector_1000000.Integers_type_Auto' finished in 12219.448 ms

JaroslavTulach · 2023-08-29T06:22:08Z

timings between 16-bit and 64-bit values are comparable

If the storage is long[] then there should be minimal difference. And there is. One would need IGV to find what is the cause.

We can see that specifying the data-type can be as much as 3x faster than relying on Auto inference.

The difference here is so huge that using VisualVM Polyglot Sampler could highlight where the time is spent.

JaroslavTulach

@Akirathan will be happy to see the benchmark code.

…om_vector

jdunkerley

Looks good to me.

…0/1146466401364738068/1146466401364738068

…om_vector

radeusgd self-assigned this Aug 22, 2023

radeusgd changed the base branch from develop to wip/radeusgd/5159-new-inmemory-value-types August 22, 2023 17:16

Base automatically changed from wip/radeusgd/5159-new-inmemory-value-types to develop August 22, 2023 18:10

radeusgd force-pushed the wip/radeusgd/6111-value-type-to-column-from_vector branch from 09469dd to 748e299 Compare August 23, 2023 11:25

radeusgd added 22 commits August 23, 2023 18:22

add value_type to Column.from_vector

022a7f5

add tests for specific value types

66262f6

change human oriented text representation of Char and Binary types

f029cad

update tests and docs

84d5b94

fixes

7e4da54

update docs, signatures and in-memory impl

6b6c3ee

WIP: tests for map, structure for zip

f06122a

enable old pending tests

e3780a5

try enabling another old test

60dcd19

WIP: map error tests

cfc2baa

zip tests

22334c9

fixes

b11a06b

fixes 2

8a7fa17

introduce ValueTypeMismatchException

04a4403

add Exception suffix to StorageTypeMismatch

18617b0

implement use_smallest for most_specific_value_type

e545cba

catch ValueTypeMismatchException in Column.from_vector

b9c0efc

fix

a3e667e

fixing typos in tests

751198b

handle Conversion_Failure in column creation

06db918

correcting small mistakes in tests - all new tests now passing

d318eb4

various fixes

b2f7c30

radeusgd force-pushed the wip/radeusgd/6111-value-type-to-column-from_vector branch from 748e299 to b2f7c30 Compare August 23, 2023 17:53

javafmt

eb15500

radeusgd marked this pull request as ready for review August 23, 2023 17:54

radeusgd requested a review from jdunkerley as a code owner August 23, 2023 17:54

radeusgd requested a review from GregoryTravis as a code owner August 23, 2023 17:54

GregoryTravis approved these changes Aug 23, 2023

View reviewed changes

radeusgd added 6 commits August 24, 2023 11:40

fix

716d642

CHANGELOG.md

3535831

amend semantics of iif and friends when encountering Mixed column

cdf1a9c

add one more test for Mixed fill_nothing

37ec3cb

Merge branch 'develop' into wip/radeusgd/6111-value-type-to-column-fr…

acbed4a

…om_vector

Add benchmarks comparing performance of Column.from_vector with exp…

465781d

…ected types

JaroslavTulach approved these changes Aug 29, 2023

View reviewed changes

enso-bot bot mentioned this pull request Aug 30, 2023

Investigate usage of Python libraries #7388

Closed

Merge branch 'develop' into wip/radeusgd/6111-value-type-to-column-fr…

20a15c2

…om_vector

jdunkerley approved these changes Aug 31, 2023

View reviewed changes

radeusgd added CI: Ready to merge This PR is eligible for automatic merge CI: Clean build required CI runners will be cleaned before and after this PR is built. labels Aug 31, 2023

radeusgd added 2 commits August 31, 2023 10:48

retriggering CI due to https://discord.com/channels/40139665559912448…

b3bcbfb

…0/1146466401364738068/1146466401364738068

Merge branch 'develop' into wip/radeusgd/6111-value-type-to-column-fr…

67a3dc3

…om_vector

mergify bot merged commit 255b424 into develop Aug 31, 2023
24 checks passed

mergify bot deleted the wip/radeusgd/6111-value-type-to-column-from_vector branch August 31, 2023 13:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `value_type` to `Column.from_vector` and `expected_value_type` to `Column.map` and `Column.zip` #7637

Add `value_type` to `Column.from_vector` and `expected_value_type` to `Column.map` and `Column.zip` #7637

radeusgd commented Aug 22, 2023 •

edited

Loading

GregoryTravis Aug 23, 2023

radeusgd Aug 24, 2023 •

edited

Loading

GregoryTravis Aug 23, 2023

radeusgd Aug 24, 2023

radeusgd commented Aug 28, 2023 •

edited

Loading

JaroslavTulach commented Aug 29, 2023

JaroslavTulach left a comment

jdunkerley left a comment

Add value_type to Column.from_vector and expected_value_type to Column.map and Column.zip #7637

Add value_type to Column.from_vector and expected_value_type to Column.map and Column.zip #7637

Conversation

radeusgd commented Aug 22, 2023 • edited Loading

Pull Request Description

Important Notes

Checklist

GregoryTravis Aug 23, 2023

Choose a reason for hiding this comment

radeusgd Aug 24, 2023 • edited Loading

Choose a reason for hiding this comment

GregoryTravis Aug 23, 2023

Choose a reason for hiding this comment

radeusgd Aug 24, 2023

Choose a reason for hiding this comment

radeusgd commented Aug 28, 2023 • edited Loading

JaroslavTulach commented Aug 29, 2023

JaroslavTulach left a comment

Choose a reason for hiding this comment

jdunkerley left a comment

Choose a reason for hiding this comment

Add `value_type` to `Column.from_vector` and `expected_value_type` to `Column.map` and `Column.zip` #7637

Add `value_type` to `Column.from_vector` and `expected_value_type` to `Column.map` and `Column.zip` #7637

radeusgd commented Aug 22, 2023 •

edited

Loading

radeusgd Aug 24, 2023 •

edited

Loading

radeusgd commented Aug 28, 2023 •

edited

Loading