ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565

sjperkins · 2021-06-21T12:53:26Z

xref https://issues.apache.org/jira/browse/ARROW-638

#10452 contains the PR implemented Complex Numbers as First Class Types

github-actions · 2021-06-21T12:53:46Z

https://issues.apache.org/jira/browse/ARROW-638

Pass ubuntu version as a docker build variable instead of a container runtime environment variable. Closes apache#10532 from kszucs/post-docs-ubuntu-version Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…rray Closes apache#10199 from amol-/ARROW-12431 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

Closes apache#10489 from cyb70289/13003-unaligned-access Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

…reader This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. `enable_parallel_column_conversion` is kept as it still affects the threaded scanner. Closes apache#10482 from lidavidm/arrow-12597 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>

…rquet INT96 timestamp Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp. See added test: `TestArrowReadWrite.DownsampleDeprecatedInt96` which demonstrates use and expected results. Main discussion of changes in [JIRA Issue ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096). Closes apache#10461 from isichei/ARROW-12096 Lead-authored-by: Karik Isichei <karik.isichei@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

Before change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in apache#1 0x7f28ae5826f4 in apache#2 0x7f28ae57fa5d in apache#3 0x7f28ae58cb0f in apache#4 0x7f28ae58bda0 in ... ``` After change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09) apache#1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24 apache#2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5 apache#3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9 apache#4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7 ... ``` Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

This is a documentation-only PR that adds an additional note for users compibiling C++ extensions using the shared libraries bundled with the python package. Adding this note on the toolchain will help resolve (confusing?) segfaults that occur. Before (toolchain) change: - segfault when running the minimal cpp example After (toolchain) change: - no segfault when running the minimal cpp example Please see the linked JIRA for more details. Closes apache#10535 from kratsg/docs/pythonBindingExtensions Lead-authored-by: Giordon Stark <kratsg@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

`R=4.1 archery docker run ubuntu-docs` Closes apache#10534 from kszucs/forward-r-arg-to-docs-build Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

…benchmark_filter' ``` $ archery benchmark list Traceback (most recent call last): File "/Users/diana/envs/arrow/bin/archery", line 33, in <module> sys.exit(load_entry_point('archery', 'console_scripts', 'archery')()) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1137, in __call__ return self.main(*args, **kwargs) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1062, in main rv = self.invoke(ctx) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 763, in invoke return __callback(*args, **kwargs) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/Users/diana/workspace/arrow/dev/archery/archery/cli.py", line 430, in benchmark_list conf = CppBenchmarkRunner.default_configuration( File "/Users/diana/workspace/arrow/dev/archery/archery/benchmark/runner.py", line 118, in default_configuration return CppConfiguration( TypeError: __init__() got an unexpected keyword argument 'benchmark_filter' ``` Closes apache#10528 from dianaclarke/ARROW-13073 Authored-by: Diana Clarke <diana.joan.clarke@gmail.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>

Closes apache#10533 from kou/glib-dataset-factory Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

See JIRA Closes apache#10512 from westonpace/feature/ARROW-13036--doc-mention-recommended-file-extension-s-for-ar Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

Closes apache#10505 from n3world/ARROW-12995-Validate_csv_opts Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Antoine Pitrou <antoine@python.org>

Recent fsspec version have started raising FileExistsError if the target directory already exists. Ignore the error, as create_dir() is supposed to succeed in that case. Closes apache#10540 from pitrou/ARROW-13090-fsspec-create-dir Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

pitrou · 2021-07-23T17:48:18Z

Perhaps @jorisvandenbossche can answer the previous message, when he's back from vacation :-)

jorisvandenbossche · 2021-08-02T09:58:18Z

failing because there's no __arrow_ext_class__ attribute, which seems to be required by the Python layer

That might have been an oversight in the PR that introduced it (I don't think we currently have python tests using an extension type that is purely implemented in C++, and not using the (Py)ExtensionType subclassing mechanisms, although we should probably have such tests).
Can you check if moving the __arrow_ext_class__ base definition at

arrow/python/pyarrow/types.pxi

Lines 776 to 784 in c51e4a1

    
               def __arrow_ext_class__(self): 
        
                   """Return an extension array class to be used for building or 
        
                   deserializing arrays with this extension type. 
        
                   This method should return a subclass of the ExtensionArray class. By 
        
                   default, if not specialized in the extension implementation, an 
        
                   extension type array will be a built-in ExtensionArray instance. 
        
                   """ 
        
                   return ExtensionArray

from the ExtensionType class to its parent class BaseExtensionType, fixes the problem?

…r Extension Types

cpp/src/arrow/python/type_traits.h

jmdeschenes · 2021-12-21T18:56:04Z

Hello,

There is an issue with the approach:

on array.pxi

cdef class ExtensionArray(Array):
    """
    Concrete class for Arrow extension arrays.
    """


    @property
    def storage(self):
        cdef:
            CExtensionArray* ext_array = <CExtensionArray*>(self.ap)


        return pyarrow_wrap_array(ext_array.storage())
#
## LINES SKIPPED
#
    def to_numpy(self, **kwargs):
        """
        Convert extension array to a numpy ndarray.

        See Also
        --------
        Array.to_numpy
        """
        return self.storage.to_numpy(**kwargs)

on table.pxi

    def to_numpy(self):
        """
        Return a NumPy copy of this array (experimental).

        Returns
        -------
        array : numpy.ndarray
        """
        cdef:
            PyObject* out
            PandasOptions c_options
            object values


        if self.type.id == _Type_EXTENSION:
            storage_array = chunked_array(
                [chunk.storage for chunk in self.iterchunks()],
                type=self.type.storage_type

Both of these "strip" the Extension type sent to the CPP code. As such, the CPP code never knows that it is dealing with an extension.

If this is to be kept, fixed_size_list would need to convert into a proper 2D numpy array(That could have several benefits, it could be done only for primitive types at the start)

@jorisvandenbossche Do you think that is something that could be acceptable?

Otherwise, letting the CPP code handle the extension type could be another option.

@sjperkins Are you still working on this PR? Is there something I can help you with?

sjperkins · 2021-12-22T14:56:18Z

Thanks for checking in @jmdeschenes. I'm still working on it, but I haven't been able to give it attention recently. IIRC there are some issues I need to work through.

Most standard Numpy data types are First-Class Arrow types and the existing code handles conversion between them. Complex Number Arrow Types are Extension Types so their handling is a bit trickier and I haven't found a clean solution with the existing code, or with newer code that I've written yet.
I was very unsure about exposing C++ Extension Types in Python mostly because working in the CPython layer is slower and more opaque. It seems you've pointed out the reverse issue, in that Python Extension Types are stripped out during conversion to the C++ layer.

Any advice/help on the above 2 issues would be welcome. Note that I will likely only get another chance to look at this next year.

jmdeschenes · 2021-12-22T16:21:50Z

Thanks for the update.

With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it.

I think the best way forward is to let the CPP handle the extension types.

Don't strip the Extension information on the Cython side
CPP could handle future extension types each with their own potential Writer. Instead of stripping the Extension information on the cython side, let the CPP handle it(e.g. if the extension is known, use the Writer, if it doesn't, use the storage). This will require some rewrite on the CPP side as the extension type is used in the conversion code but I don't understand when it applies. To me this looks like the least intrusive option.

fixed_list_size_arrays generate a numpy array of pyarrow arrays. If fixed_list_size were to yield a 2D numpy array(which is directly mappable) there would be little needs for the extension side being handled at all. I guess this was done to ensure that the columns could be read into pandas. When calling to_numpy I would expect to get a 2D array.

As a separate issue, When trying to handle complex arrays from the python side only using extension arrays. It mostly works but the extension arrays don't play nice with ChunkedArrays

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

class ComplexArray(pa.ExtensionArray):
    def to_numpy(self):
        array = self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
        return array

class ComplexType(pa.ExtensionType):
    def __init__(self):
        pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), "arrow.complex128")
    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''
    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return ComplexType()
    def __arrow_ext_class__(self):
        return ComplexArray

complex128 = ComplexType()
pa.register_extension_type(complex128)

array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
float_array = pa.array(array, type=pa.float64())
list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
pa_array = ComplexArray.from_storage(complex128, list_array)
# This will yield the correct Numpy Array
assert (pa_array.to_numpy() == array).all()

It works but we get into issues when writing/reading from parquet

table = pa.Table.from_arrays([pa_array], names=['cpl'])
pq.write_table(table, "complex_test.parquet")
cpl_df = pq.read_table("complex_test.parquet")
# This line will create a numpy array of pyarrow arrays
cpl_df['cpl'].to_numpy()
# If you have a single chunk, this will work
assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()

# Otherwise the solution is more involved
start_length = 0
end_length = 0
max_length = cpl_df['cpl'].length()
final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
for chunk in cpl_df['cpl'].iterchunks():
    temp_numpy = chunk.to_numpy()
    end_length += temp_numpy.shape[0]
    final_numpy[start_length:end_length] = temp_numpy
    start_length = end_length
assert (final_numpy == array).all()

sjperkins · 2021-12-23T09:32:34Z

With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it.

Thanks, yes I would like to take a look.

I think the best way forward is to let the CPP handle the extension types.

1. Don't strip the Extension information on the Cython side

2. CPP could handle future extension types each with their own potential `Writer`. Instead of stripping the Extension information on the cython side, let the CPP handle it(e.g. if the extension is known, use the Writer, if it doesn't, use the storage). This will require some rewrite on the CPP side as the extension type is used in the conversion code but I don't understand when it applies. To me this looks like the least intrusive option.

Yes, I agree that this is the way forward. Mirroring what you've described back to you, this will necessitate API changes to the C++ Extension Framework to support Custom Writers?

fixed_list_size_arrays generate a numpy array of pyarrow arrays. If fixed_list_size were to yield a 2D numpy array(which is directly mappable) there would be little needs for the extension side being handled at all. I guess this was done to ensure that the columns could be read into pandas. When calling to_numpy I would expect to get a 2D array.

As a separate issue, When trying to handle complex arrays from the python side only using extension arrays. It mostly works but the extension arrays don't play nice with ChunkedArrays

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

class ComplexArray(pa.ExtensionArray):
    def to_numpy(self):
        array = self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
        return array

class ComplexType(pa.ExtensionType):
    def __init__(self):
        pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), "arrow.complex128")
    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''
    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return ComplexType()
    def __arrow_ext_class__(self):
        return ComplexArray

complex128 = ComplexType()
pa.register_extension_type(complex128)

array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
float_array = pa.array(array, type=pa.float64())
list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
pa_array = ComplexArray.from_storage(complex128, list_array)
# This will yield the correct Numpy Array
assert (pa_array.to_numpy() == array).all()

It works but we get into issues when writing/reading from parquet

table = pa.Table.from_arrays([pa_array], names=['cpl'])
pq.write_table(table, "complex_test.parquet")
cpl_df = pq.read_table("complex_test.parquet")
# This line will create a numpy array of pyarrow arrays
cpl_df['cpl'].to_numpy()
# If you have a single chunk, this will work
assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()

# Otherwise the solution is more involved
start_length = 0
end_length = 0
max_length = cpl_df['cpl'].length()
final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
for chunk in cpl_df['cpl'].iterchunks():
    temp_numpy = chunk.to_numpy()
    end_length += temp_numpy.shape[0]
    final_numpy[start_length:end_length] = temp_numpy
    start_length = end_length
assert (final_numpy == array).all()

It does work quite nicely in Python. In fact, I implemented something similar in a repository I'm working on: https://github.com/ska-sa/dask-ms/blob/55c987e5f00c24a82c16363f681a966e45590e06/daskms/experimental/arrow/extension_types.py#L145-L148. IIRC I ran into similar issues with chunked Extension Arrays.

Do you have an idea as to how this should be handled in the C++ layer?

sjperkins · 2022-01-26T08:42:40Z

With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it.

Hi @jmdeschenes, I'll be devoting some time to this PR over the next few weeks. Would it be possible to see your branch?

pitrou · 2022-10-05T16:04:59Z

For the record, we now have an official procedure for standardizing extension types. If you would like to resurrect this, you should post a proposal (or start discussing about such a proposal) on the ML.

amol- · 2023-03-30T17:22:00Z

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

sjperkins · 2023-03-30T17:47:36Z

This PR is very out of date, a new PR supporting this functionality will be required.

sjperkins added 8 commits June 9, 2021 16:29

Complex Number Extension Type

8e318d0

lint

bd2cc0b

more linting

2b6f0d1

Copyright headers

1db170d

Merge branch 'apache:master' into complex-number-extension-support

9880dc2

Merge branch 'master' into complex-number-extension-support

faa8526

Merge branch 'master' into complex-number-extension-support

5fa7eae

[skip ci] WIP

e4b9309

github-actions bot added the Component: C++ label Jun 21, 2021

sjperkins mentioned this pull request Jun 21, 2021

ARROW-638: [C++] Complex Number Support #10452

Draft

sjperkins and others added 19 commits June 21, 2021 15:33

Add Type Traits

fe1e145

[skip ci] #include complex_type.h

76c8ef6

Fix header include

5b72101

Remove Complex{Float,Double} Types in Visitor Actions

0d7524e

cleanup

ae58a3a

Remove COMPLEX_{FLOAT,DOUBLE} from Type

dc2125b

ARROW-12431: [Python] Mask is inverted when creating FixedSizeBinaryA…

5521edf

…rray Closes apache#10199 from amol-/ARROW-12431 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

ARROW-13003: [C++] Fix key map unaligned access

40585c1

Closes apache#10489 from cyb70289/13003-unaligned-access Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

ARROW-13082: [CI] Forward R argument to ubuntu-docs build

b33a1a9

`R=4.1 archery docker run ubuntu-docs` Closes apache#10534 from kszucs/forward-r-arg-to-docs-build Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

ARROW-11782: [GLib][Ruby][Dataset] Remove bindings for internal classes

4fc96b5

Closes apache#10533 from kou/glib-dataset-factory Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

ARROW-13036: [Doc] Mention recommended file extension(s) for Arrow IPC

cc0bd60

See JIRA Closes apache#10512 from westonpace/feature/ARROW-13036--doc-mention-recommended-file-extension-s-for-ar Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

ARROW-12995: [C++] Add validation to CSV options

85a4052

Closes apache#10505 from n3world/ARROW-12995-Validate_csv_opts Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Antoine Pitrou <antoine@python.org>

sjperkins added 9 commits August 24, 2021 09:59

Merge branch 'master' into complex-number-extension-support

9ffbf56

Merge branch 'master' into complex-number-extension-support

c9abc06

[skip ci] Move __arrow_ext_class__ to BaseExtensionType

7ed676a

[skip ci] Progress towards the Pandas API understanding Complex Numbe…

b3440d5

…r Extension Types

[skip ci] arrow.extension.complex{64,128} -> arrow.complex{64,128}

aa580e6

[skip ci] WIP

8f30936

[skip ci] Mark converted complex columns

7ec8d36

[skip ci] Fix up complex -> float conversion

11a285c

[skip ci] WIP

765022e

jmdeschenes reviewed Dec 21, 2021

View reviewed changes

cpp/src/arrow/python/type_traits.h Outdated Show resolved Hide resolved

sjperkins added 2 commits December 22, 2021 16:43

Fix byte widths for complex types

e31d556

Merge branch 'master' into complex-number-extension-support

d741b85

[skip ci] Add ComplexWriter WIP

7e8dbbb

Merge branch 'master' into complex-number-extension-support

f9803a2

ghuls mentioned this pull request Jun 3, 2022

Complex number support pola-rs/polars#3564

Open

sjperkins mentioned this pull request Jan 20, 2023

[Python] C++ Extension Types aren't correctly exposed in pyarrow #33801

Closed

jorisvandenbossche mentioned this pull request Feb 2, 2023

[Python] Custom Python type/array subclasses for ExtensionTypes implemented in C++ #33997

Open

jorisvandenbossche mentioned this pull request Mar 2, 2023

GH-33923: [Docs] Tensor canonical extension type specification #33925

Merged

amol- closed this Mar 30, 2023

asfimport mentioned this pull request Jan 26, 2022

[Format] Add metadata for single and double precision complex numbers #16264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565

ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565

sjperkins commented Jun 21, 2021 •

edited

Loading

github-actions bot commented Jun 21, 2021

pitrou commented Jul 23, 2021

jorisvandenbossche commented Aug 2, 2021 •

edited

Loading

jmdeschenes commented Dec 21, 2021 •

edited

Loading

sjperkins commented Dec 22, 2021

jmdeschenes commented Dec 22, 2021

sjperkins commented Dec 23, 2021

sjperkins commented Jan 26, 2022

pitrou commented Oct 5, 2022

amol- commented Mar 30, 2023

sjperkins commented Mar 30, 2023

ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565

ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565

Conversation

sjperkins commented Jun 21, 2021 • edited Loading

github-actions bot commented Jun 21, 2021

pitrou commented Jul 23, 2021

jorisvandenbossche commented Aug 2, 2021 • edited Loading

jmdeschenes commented Dec 21, 2021 • edited Loading

sjperkins commented Dec 22, 2021

jmdeschenes commented Dec 22, 2021

sjperkins commented Dec 23, 2021

sjperkins commented Jan 26, 2022

pitrou commented Oct 5, 2022

amol- commented Mar 30, 2023

sjperkins commented Mar 30, 2023

sjperkins commented Jun 21, 2021 •

edited

Loading

jorisvandenbossche commented Aug 2, 2021 •

edited

Loading

jmdeschenes commented Dec 21, 2021 •

edited

Loading