Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction #34

Closed
wants to merge 6 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Mar 23, 2016

As the initial scribe for the Arrow format, I made a mistake in what the null bits mean (1 for not-null, 0 for null). I also addressed ARROW-56 (bit-numbering) here.

Database systems are split on this subject. PostgreSQL for example does it this way:

http://www.postgresql.org/docs/9.5/static/storage-page-layout.html

In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null.

Since the Drill implementation predates the Arrow project, I think it's safe to go with this.

This patch also includes ARROW-76 which adds a "null count" to the memory layout indicating the actual number of nulls in an array. This also strikes the "non-nullable" distinction from the memory layout as there is no semantic difference between arrays with null count 0 and a non-nullable array. Instead, users may choose to set nullable=false in the schema metadata and verify that Arrow memory conforms to the schema.

@@ -90,11 +90,27 @@ maximum of 2^31 - 1 elements. We choose a signed int32 for a couple reasons:
Any relative type can be nullable or non-nullable.

Nullable arrays have a contiguous memory buffer, known as the null bitmask,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to frame this paragraph in terms of validity. Drills documentation phrases it well I think (it doesn't reference a null bitmark)

Nullable values are represented by a vector of bit values. Each bit in the vector corresponds to an element in the ValueVector. If the bit is not set, the value is NULL.

@emkornfield
Copy link
Contributor

If you do decide rephrase the bitmask as a validity vector, you should probably update the documentation on the flatbufferschema

/// The number of buffers appended to this list depends on the schema. For
/// example, most primitive arrays will have 2 buffers, 1 for the null bitmap
/// and 1 for the values. For struct arrays, there will only be a single
/// buffer for the null bitmap

@wesm
Copy link
Member Author

wesm commented Mar 23, 2016

I rephrased the language a little bit. I'll appeal to one of the other committers to review.

@wesm
Copy link
Member Author

wesm commented Mar 23, 2016

Postgres uses the term "null bitmap" if that seems reasonable so I will try to use that consistently in the code and format docs

@emkornfield
Copy link
Contributor

Thanks, I probably should have prefaced the above with IMHO. In my mind, I prefix the name of the bitmap with an "is_" and assume 1 means true.

null) bitmap, whose length is large enough to have 1 bit for each array slot.
Nullable arrays have a contiguous memory buffer, known as the null (or
validity) bitmap, whose length is large enough to have 1 bit for each array
slot.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose that the null bitmap is always an multiple of 8 bytes in length. This simplifies some code to avoid having to manage partial word conditions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There's also the SIMD question — if these buffers are word-aligned then there won't be concerns (? someone with more expertise should opine) with aligned allocations

@wesm
Copy link
Member Author

wesm commented Mar 23, 2016

@jacques-n I'm going to go ahead and expand the patch to account for ARROW-76, changes to be posted shortly. I'll await your +1 and further comments.

@wesm wesm changed the title ARROW-62: Clarify interpretation of set bits in null bitmaps, indicate bit-endianness ARROW-62: Clarify null bitmap interpretation, indicate bit-endianness, add null count, remove non-nullable physical distinction Mar 23, 2016
@wesm
Copy link
Member Author

wesm commented Mar 25, 2016

Merging these format changes. Debate on these subjects may continue on the mailing list. Thank you

@asfgit asfgit closed this in c06b765 Mar 25, 2016
@wesm wesm deleted the ARROW-62 branch March 25, 2016 02:20
wesm added a commit to wesm/arrow that referenced this pull request Sep 2, 2018
Requires PARQUET-485 (apache#32)

The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with
other implementations of Parquet. This patch adds an implementation of plain
encoding and uses BitReader instead of RleDecoder to decode plain-encoded
boolean data. Unit tests to verify.

Also closes PR apache#12. Thanks to @edani for reporting.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#34 from wesm/PARQUET-454 and squashes the following commits:

01cb5a7 [Wes McKinney] Use a seed in the data generation
0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.
wesm added a commit to wesm/arrow that referenced this pull request Sep 4, 2018
Requires PARQUET-485 (apache#32)

The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with
other implementations of Parquet. This patch adds an implementation of plain
encoding and uses BitReader instead of RleDecoder to decode plain-encoded
boolean data. Unit tests to verify.

Also closes PR apache#12. Thanks to @edani for reporting.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#34 from wesm/PARQUET-454 and squashes the following commits:

01cb5a7 [Wes McKinney] Use a seed in the data generation
0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.

Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
wesm added a commit to wesm/arrow that referenced this pull request Sep 6, 2018
Requires PARQUET-485 (apache#32)

The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with
other implementations of Parquet. This patch adds an implementation of plain
encoding and uses BitReader instead of RleDecoder to decode plain-encoded
boolean data. Unit tests to verify.

Also closes PR apache#12. Thanks to @edani for reporting.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#34 from wesm/PARQUET-454 and squashes the following commits:

01cb5a7 [Wes McKinney] Use a seed in the data generation
0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.

Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
wesm added a commit to wesm/arrow that referenced this pull request Sep 7, 2018
Requires PARQUET-485 (apache#32)

The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with
other implementations of Parquet. This patch adds an implementation of plain
encoding and uses BitReader instead of RleDecoder to decode plain-encoded
boolean data. Unit tests to verify.

Also closes PR apache#12. Thanks to @edani for reporting.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#34 from wesm/PARQUET-454 and squashes the following commits:

01cb5a7 [Wes McKinney] Use a seed in the data generation
0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.

Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
wesm added a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Requires PARQUET-485 (apache#32)

The boolean Encoding::PLAIN code path was using RleDecoder, inconsistent with
other implementations of Parquet. This patch adds an implementation of plain
encoding and uses BitReader instead of RleDecoder to decode plain-encoded
boolean data. Unit tests to verify.

Also closes PR apache#12. Thanks to @edani for reporting.

Author: Wes McKinney <wes@cloudera.com>

Closes apache#34 from wesm/PARQUET-454 and squashes the following commits:

01cb5a7 [Wes McKinney] Use a seed in the data generation
0bf5d8a [Wes McKinney] Fix inconsistencies with boolean PLAIN encoding.

Change-Id: I1be5252c654d4864d14c3cdd70d63c507e0a9403
kou pushed a commit that referenced this pull request May 10, 2020
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit ed5f534
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test #20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test #21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test #22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test #23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test #24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test #25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test #26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test #27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test #28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test #29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test #30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test #31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test #32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test #33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test #34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test #35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test #36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test #37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test #38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test #39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test #40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test #41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test #42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test #43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test #44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test #45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test #46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test #47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test #48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test #49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test #50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test #51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes #7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request Oct 8, 2021
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup
zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup
zhztheplayer added a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup
rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup
zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Apr 26, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup
rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request May 27, 2022
…he same time (apache#34)

* Revert "Add AutoBufferLedger (apache#31)"

This reverts commit e48da37.

* Commit 1

* Commit 2

* Fix config builder visibility in Scala

* Commit 2 Fixup

* Commit 3

* Commit 3 Fixup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants