Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565

Closed

Conversation

sjperkins
Copy link
Contributor

@sjperkins sjperkins commented Jun 21, 2021

xref https://issues.apache.org/jira/browse/ARROW-638

#10452 contains the PR implemented Complex Numbers as First Class Types

@github-actions
Copy link

sjperkins and others added 19 commits June 21, 2021 15:33
Pass ubuntu version as a docker build variable instead of a container runtime environment variable.

Closes apache#10532 from kszucs/post-docs-ubuntu-version

Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…rray

Closes apache#10199 from amol-/ARROW-12431

Authored-by: Alessandro Molina <amol@turbogears.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Closes apache#10489 from cyb70289/13003-unaligned-access

Authored-by: Yibo Cai <yibo.cai@arm.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
…reader

This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. `enable_parallel_column_conversion` is kept as it still affects the threaded scanner.

Closes apache#10482 from lidavidm/arrow-12597

Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
…rquet INT96 timestamp

Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp.

See added test: `TestArrowReadWrite.DownsampleDeprecatedInt96` which demonstrates use and expected results.

Main discussion of changes in [JIRA Issue ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096).

Closes apache#10461 from isichei/ARROW-12096

Lead-authored-by: Karik Isichei <karik.isichei@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Before change:

```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in
    apache#1 0x7f28ae5826f4 in
    apache#2 0x7f28ae57fa5d in
    apache#3 0x7f28ae58cb0f in
    apache#4 0x7f28ae58bda0 in
    ...
```

After change:
```
Direct leak of 65536 byte(s) in 1 object(s) allocated from:
    #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09)
    apache#1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24
    apache#2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5
    apache#3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9
    apache#4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7
    ...
```

Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
This is a documentation-only PR that adds an additional note for users compibiling C++ extensions using the shared libraries bundled with the python package. Adding this note on the toolchain will help resolve (confusing?) segfaults that occur.

Before (toolchain) change:

- segfault when running the minimal cpp example

After (toolchain) change:

- no segfault when running the minimal cpp example

Please see the linked JIRA for more details.

Closes apache#10535 from kratsg/docs/pythonBindingExtensions

Lead-authored-by: Giordon Stark <kratsg@gmail.com>
Co-authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Antoine Pitrou <antoine@python.org>
`R=4.1 archery docker run ubuntu-docs`

Closes apache#10534 from kszucs/forward-r-arg-to-docs-build

Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…benchmark_filter'

```
$ archery benchmark list
Traceback (most recent call last):
  File "/Users/diana/envs/arrow/bin/archery", line 33, in <module>
    sys.exit(load_entry_point('archery', 'console_scripts', 'archery')())
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1668, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/Users/diana/workspace/arrow/dev/archery/archery/cli.py", line 430, in benchmark_list
    conf = CppBenchmarkRunner.default_configuration(
  File "/Users/diana/workspace/arrow/dev/archery/archery/benchmark/runner.py", line 118, in default_configuration
    return CppConfiguration(
TypeError: __init__() got an unexpected keyword argument 'benchmark_filter'
```

Closes apache#10528 from dianaclarke/ARROW-13073

Authored-by: Diana Clarke <diana.joan.clarke@gmail.com>
Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Closes apache#10533 from kou/glib-dataset-factory

Authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Sutou Kouhei <kou@clear-code.com>
See JIRA

Closes apache#10512 from westonpace/feature/ARROW-13036--doc-mention-recommended-file-extension-s-for-ar

Authored-by: Weston Pace <weston.pace@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Closes apache#10505 from n3world/ARROW-12995-Validate_csv_opts

Authored-by: Nate Clark <nate@neworld.us>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Recent fsspec version have started raising FileExistsError if the target directory already exists.  Ignore the error, as create_dir() is supposed to succeed in that case.

Closes apache#10540 from pitrou/ARROW-13090-fsspec-create-dir

Authored-by: Antoine Pitrou <antoine@python.org>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@pitrou
Copy link
Member

pitrou commented Jul 23, 2021

Perhaps @jorisvandenbossche can answer the previous message, when he's back from vacation :-)

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 2, 2021

failing because there's no __arrow_ext_class__ attribute, which seems to be required by the Python layer

That might have been an oversight in the PR that introduced it (I don't think we currently have python tests using an extension type that is purely implemented in C++, and not using the (Py)ExtensionType subclassing mechanisms, although we should probably have such tests).
Can you check if moving the __arrow_ext_class__ base definition at

def __arrow_ext_class__(self):
"""Return an extension array class to be used for building or
deserializing arrays with this extension type.
This method should return a subclass of the ExtensionArray class. By
default, if not specialized in the extension implementation, an
extension type array will be a built-in ExtensionArray instance.
"""
return ExtensionArray

from the ExtensionType class to its parent class BaseExtensionType, fixes the problem?

@jmdeschenes
Copy link

jmdeschenes commented Dec 21, 2021

Hello,

There is an issue with the approach:

on array.pxi

cdef class ExtensionArray(Array):
    """
    Concrete class for Arrow extension arrays.
    """


    @property
    def storage(self):
        cdef:
            CExtensionArray* ext_array = <CExtensionArray*>(self.ap)


        return pyarrow_wrap_array(ext_array.storage())
#
## LINES SKIPPED
#
    def to_numpy(self, **kwargs):
        """
        Convert extension array to a numpy ndarray.

        See Also
        --------
        Array.to_numpy
        """
        return self.storage.to_numpy(**kwargs)

on table.pxi

    def to_numpy(self):
        """
        Return a NumPy copy of this array (experimental).

        Returns
        -------
        array : numpy.ndarray
        """
        cdef:
            PyObject* out
            PandasOptions c_options
            object values


        if self.type.id == _Type_EXTENSION:
            storage_array = chunked_array(
                [chunk.storage for chunk in self.iterchunks()],
                type=self.type.storage_type

Both of these "strip" the Extension type sent to the CPP code. As such, the CPP code never knows that it is dealing with an extension.

If this is to be kept, fixed_size_list would need to convert into a proper 2D numpy array(That could have several benefits, it could be done only for primitive types at the start)

@jorisvandenbossche Do you think that is something that could be acceptable?

Otherwise, letting the CPP code handle the extension type could be another option.

@sjperkins Are you still working on this PR? Is there something I can help you with?

@sjperkins
Copy link
Contributor Author

Thanks for checking in @jmdeschenes. I'm still working on it, but I haven't been able to give it attention recently. IIRC there are some issues I need to work through.

  1. Most standard Numpy data types are First-Class Arrow types and the existing code handles conversion between them. Complex Number Arrow Types are Extension Types so their handling is a bit trickier and I haven't found a clean solution with the existing code, or with newer code that I've written yet.
  2. I was very unsure about exposing C++ Extension Types in Python mostly because working in the CPython layer is slower and more opaque. It seems you've pointed out the reverse issue, in that Python Extension Types are stripped out during conversion to the C++ layer.

Any advice/help on the above 2 issues would be welcome. Note that I will likely only get another chance to look at this next year.

@jmdeschenes
Copy link

Thanks for the update.

With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it.

I think the best way forward is to let the CPP handle the extension types.

  1. Don't strip the Extension information on the Cython side
  2. CPP could handle future extension types each with their own potential Writer. Instead of stripping the Extension information on the cython side, let the CPP handle it(e.g. if the extension is known, use the Writer, if it doesn't, use the storage). This will require some rewrite on the CPP side as the extension type is used in the conversion code but I don't understand when it applies. To me this looks like the least intrusive option.

fixed_list_size_arrays generate a numpy array of pyarrow arrays. If fixed_list_size were to yield a 2D numpy array(which is directly mappable) there would be little needs for the extension side being handled at all. I guess this was done to ensure that the columns could be read into pandas. When calling to_numpy I would expect to get a 2D array.

As a separate issue, When trying to handle complex arrays from the python side only using extension arrays. It mostly works but the extension arrays don't play nice with ChunkedArrays

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

class ComplexArray(pa.ExtensionArray):
    def to_numpy(self):
        array = self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
        return array

class ComplexType(pa.ExtensionType):
    def __init__(self):
        pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), "arrow.complex128")
    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''
    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return ComplexType()
    def __arrow_ext_class__(self):
        return ComplexArray

complex128 = ComplexType()
pa.register_extension_type(complex128)

array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
float_array = pa.array(array, type=pa.float64())
list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
pa_array = ComplexArray.from_storage(complex128, list_array)
# This will yield the correct Numpy Array
assert (pa_array.to_numpy() == array).all()

It works but we get into issues when writing/reading from parquet

table = pa.Table.from_arrays([pa_array], names=['cpl'])
pq.write_table(table, "complex_test.parquet")
cpl_df = pq.read_table("complex_test.parquet")
# This line will create a numpy array of pyarrow arrays
cpl_df['cpl'].to_numpy()
# If you have a single chunk, this will work
assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()

# Otherwise the solution is more involved
start_length = 0
end_length = 0
max_length = cpl_df['cpl'].length()
final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
for chunk in cpl_df['cpl'].iterchunks():
    temp_numpy = chunk.to_numpy()
    end_length += temp_numpy.shape[0]
    final_numpy[start_length:end_length] = temp_numpy
    start_length = end_length
assert (final_numpy == array).all()

@sjperkins
Copy link
Contributor Author

With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it.

Thanks, yes I would like to take a look.

I think the best way forward is to let the CPP handle the extension types.

1. Don't strip the Extension information on the Cython side

2. CPP could handle future extension types each with their own potential `Writer`. Instead of stripping the Extension information on the cython side, let the CPP handle it(e.g. if the extension is known, use the Writer, if it doesn't, use the storage). This will require some rewrite on the CPP side as the extension type is used in the conversion code but I don't understand when it applies. To me this looks like the least intrusive option.

Yes, I agree that this is the way forward. Mirroring what you've described back to you, this will necessitate API changes to the C++ Extension Framework to support Custom Writers?

fixed_list_size_arrays generate a numpy array of pyarrow arrays. If fixed_list_size were to yield a 2D numpy array(which is directly mappable) there would be little needs for the extension side being handled at all. I guess this was done to ensure that the columns could be read into pandas. When calling to_numpy I would expect to get a 2D array.

As a separate issue, When trying to handle complex arrays from the python side only using extension arrays. It mostly works but the extension arrays don't play nice with ChunkedArrays

import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np

class ComplexArray(pa.ExtensionArray):
    def to_numpy(self):
        array = self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
        return array

class ComplexType(pa.ExtensionType):
    def __init__(self):
        pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), "arrow.complex128")
    def __arrow_ext_serialize__(self):
        # since we don't have a parameterized type, we don't need extra
        # metadata to be deserialized
        return b''
    @classmethod
    def __arrow_ext_deserialize__(self, storage_type, serialized):
        # return an instance of this subclass given the serialized
        # metadata.
        return ComplexType()
    def __arrow_ext_class__(self):
        return ComplexArray

complex128 = ComplexType()
pa.register_extension_type(complex128)

array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
float_array = pa.array(array, type=pa.float64())
list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
pa_array = ComplexArray.from_storage(complex128, list_array)
# This will yield the correct Numpy Array
assert (pa_array.to_numpy() == array).all()

It works but we get into issues when writing/reading from parquet

table = pa.Table.from_arrays([pa_array], names=['cpl'])
pq.write_table(table, "complex_test.parquet")
cpl_df = pq.read_table("complex_test.parquet")
# This line will create a numpy array of pyarrow arrays
cpl_df['cpl'].to_numpy()
# If you have a single chunk, this will work
assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()

# Otherwise the solution is more involved
start_length = 0
end_length = 0
max_length = cpl_df['cpl'].length()
final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
for chunk in cpl_df['cpl'].iterchunks():
    temp_numpy = chunk.to_numpy()
    end_length += temp_numpy.shape[0]
    final_numpy[start_length:end_length] = temp_numpy
    start_length = end_length
assert (final_numpy == array).all()

It does work quite nicely in Python. In fact, I implemented something similar in a repository I'm working on: https://github.com/ska-sa/dask-ms/blob/55c987e5f00c24a82c16363f681a966e45590e06/daskms/experimental/arrow/extension_types.py#L145-L148. IIRC I ran into similar issues with chunked Extension Arrays.

Do you have an idea as to how this should be handled in the C++ layer?

@sjperkins
Copy link
Contributor Author

With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it.

Hi @jmdeschenes, I'll be devoting some time to this PR over the next few weeks. Would it be possible to see your branch?

@pitrou
Copy link
Member

pitrou commented Oct 5, 2022

For the record, we now have an official procedure for standardizing extension types. If you would like to resurrect this, you should post a proposal (or start discussing about such a proposal) on the ML.

@amol-
Copy link
Member

amol- commented Mar 30, 2023

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

@amol- amol- closed this Mar 30, 2023
@sjperkins
Copy link
Contributor Author

This PR is very out of date, a new PR supporting this functionality will be required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.