-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565
ARROW-638: [C++] Complex Number Support via ExtensionTypes #10565
Conversation
Pass ubuntu version as a docker build variable instead of a container runtime environment variable. Closes apache#10532 from kszucs/post-docs-ubuntu-version Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…rray Closes apache#10199 from amol-/ARROW-12431 Authored-by: Alessandro Molina <amol@turbogears.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Closes apache#10489 from cyb70289/13003-unaligned-access Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
…reader This adds an OptionalParallelForAsync which lets us have per-row-group parallelism without nested parallelism in the async Parquet reader. This also uses TransferAlways, taking care of ARROW-12916. `enable_parallel_column_conversion` is kept as it still affects the threaded scanner. Closes apache#10482 from lidavidm/arrow-12597 Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: David Li <li.davidm96@gmail.com>
…rquet INT96 timestamp Have added functionality in C++ code to allow users to define the arrow timestamp unit when reading parquet INT96 types. This avoids the overflow bug when trying to convert INT96 values which have dates which are out of bounds for Arrow NS Timestamp. See added test: `TestArrowReadWrite.DownsampleDeprecatedInt96` which demonstrates use and expected results. Main discussion of changes in [JIRA Issue ARROW-12096](https://issues.apache.org/jira/browse/ARROW-12096). Closes apache#10461 from isichei/ARROW-12096 Lead-authored-by: Karik Isichei <karik.isichei@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
Before change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in apache#1 0x7f28ae5826f4 in apache#2 0x7f28ae57fa5d in apache#3 0x7f28ae58cb0f in apache#4 0x7f28ae58bda0 in ... ``` After change: ``` Direct leak of 65536 byte(s) in 1 object(s) allocated from: #0 0x522f09 in posix_memalign (/build/cpp/debug/arrow-dataset-file-csv-test+0x522f09) apache#1 0x7f28ae5826f4 in arrow::(anonymous namespace)::SystemAllocator::AllocateAligned(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:213:24 apache#2 0x7f28ae57fa5d in arrow::BaseMemoryPoolImpl<arrow::(anonymous namespace)::SystemAllocator>::Allocate(long, unsigned char**) /arrow/cpp/src/arrow/memory_pool.cc:405:5 apache#3 0x7f28ae58cb0f in arrow::PoolBuffer::Reserve(long) /arrow/cpp/src/arrow/memory_pool.cc:717:9 apache#4 0x7f28ae58bda0 in arrow::PoolBuffer::Resize(long, bool) /arrow/cpp/src/arrow/memory_pool.cc:741:7 ... ``` Closes apache#10498 from westonpace/feature/ARROW-13027--c-fix-asan-stack-traces-in-ci Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
This is a documentation-only PR that adds an additional note for users compibiling C++ extensions using the shared libraries bundled with the python package. Adding this note on the toolchain will help resolve (confusing?) segfaults that occur. Before (toolchain) change: - segfault when running the minimal cpp example After (toolchain) change: - no segfault when running the minimal cpp example Please see the linked JIRA for more details. Closes apache#10535 from kratsg/docs/pythonBindingExtensions Lead-authored-by: Giordon Stark <kratsg@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>
`R=4.1 archery docker run ubuntu-docs` Closes apache#10534 from kszucs/forward-r-arg-to-docs-build Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
…benchmark_filter' ``` $ archery benchmark list Traceback (most recent call last): File "/Users/diana/envs/arrow/bin/archery", line 33, in <module> sys.exit(load_entry_point('archery', 'console_scripts', 'archery')()) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1137, in __call__ return self.main(*args, **kwargs) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1062, in main rv = self.invoke(ctx) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1668, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/core.py", line 763, in invoke return __callback(*args, **kwargs) File "/Users/diana/envs/arrow/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func return f(get_current_context(), *args, **kwargs) File "/Users/diana/workspace/arrow/dev/archery/archery/cli.py", line 430, in benchmark_list conf = CppBenchmarkRunner.default_configuration( File "/Users/diana/workspace/arrow/dev/archery/archery/benchmark/runner.py", line 118, in default_configuration return CppConfiguration( TypeError: __init__() got an unexpected keyword argument 'benchmark_filter' ``` Closes apache#10528 from dianaclarke/ARROW-13073 Authored-by: Diana Clarke <diana.joan.clarke@gmail.com> Signed-off-by: Yibo Cai <yibo.cai@arm.com>
Closes apache#10533 from kou/glib-dataset-factory Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>
See JIRA Closes apache#10512 from westonpace/feature/ARROW-13036--doc-mention-recommended-file-extension-s-for-ar Authored-by: Weston Pace <weston.pace@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>
Closes apache#10505 from n3world/ARROW-12995-Validate_csv_opts Authored-by: Nate Clark <nate@neworld.us> Signed-off-by: Antoine Pitrou <antoine@python.org>
Recent fsspec version have started raising FileExistsError if the target directory already exists. Ignore the error, as create_dir() is supposed to succeed in that case. Closes apache#10540 from pitrou/ARROW-13090-fsspec-create-dir Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Perhaps @jorisvandenbossche can answer the previous message, when he's back from vacation :-) |
That might have been an oversight in the PR that introduced it (I don't think we currently have python tests using an extension type that is purely implemented in C++, and not using the (Py)ExtensionType subclassing mechanisms, although we should probably have such tests). arrow/python/pyarrow/types.pxi Lines 776 to 784 in c51e4a1
from the |
…r Extension Types
Hello, There is an issue with the approach: on array.pxi cdef class ExtensionArray(Array):
"""
Concrete class for Arrow extension arrays.
"""
@property
def storage(self):
cdef:
CExtensionArray* ext_array = <CExtensionArray*>(self.ap)
return pyarrow_wrap_array(ext_array.storage())
#
## LINES SKIPPED
#
def to_numpy(self, **kwargs):
"""
Convert extension array to a numpy ndarray.
See Also
--------
Array.to_numpy
"""
return self.storage.to_numpy(**kwargs)
on table.pxi def to_numpy(self):
"""
Return a NumPy copy of this array (experimental).
Returns
-------
array : numpy.ndarray
"""
cdef:
PyObject* out
PandasOptions c_options
object values
if self.type.id == _Type_EXTENSION:
storage_array = chunked_array(
[chunk.storage for chunk in self.iterchunks()],
type=self.type.storage_type
Both of these "strip" the Extension type sent to the CPP code. As such, the CPP code never knows that it is dealing with an extension. If this is to be kept, fixed_size_list would need to convert into a proper 2D numpy array(That could have several benefits, it could be done only for primitive types at the start) @jorisvandenbossche Do you think that is something that could be acceptable? Otherwise, letting the CPP code handle the extension type could be another option. @sjperkins Are you still working on this PR? Is there something I can help you with? |
Thanks for checking in @jmdeschenes. I'm still working on it, but I haven't been able to give it attention recently. IIRC there are some issues I need to work through.
Any advice/help on the above 2 issues would be welcome. Note that I will likely only get another chance to look at this next year. |
Thanks for the update. With some changes, I was able to get it your branch to work and output a complex array. It is still preliminary. Let me know if you want to see it. I think the best way forward is to let the CPP handle the extension types.
fixed_list_size_arrays generate a numpy array of pyarrow arrays. If fixed_list_size were to yield a 2D numpy array(which is directly mappable) there would be little needs for the extension side being handled at all. I guess this was done to ensure that the columns could be read into pandas. When calling As a separate issue, When trying to handle complex arrays from the python side only using extension arrays. It mostly works but the extension arrays don't play nice with ChunkedArrays import pyarrow as pa
import pyarrow.parquet as pq
import numpy as np
class ComplexArray(pa.ExtensionArray):
def to_numpy(self):
array = self.storage.flatten().to_numpy(zero_copy_only=True).view('complex128')
return array
class ComplexType(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.list_(pa.float64(), 2), "arrow.complex128")
def __arrow_ext_serialize__(self):
# since we don't have a parameterized type, we don't need extra
# metadata to be deserialized
return b''
@classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return ComplexType()
def __arrow_ext_class__(self):
return ComplexArray
complex128 = ComplexType()
pa.register_extension_type(complex128)
array = np.array([1.0+1j, 2.0+2.0j, 3.0+3.0j]).view('float64')
float_array = pa.array(array, type=pa.float64())
list_array = pa.FixedSizeListArray.from_arrays(float_array, 2)
pa_array = ComplexArray.from_storage(complex128, list_array)
# This will yield the correct Numpy Array
assert (pa_array.to_numpy() == array).all() It works but we get into issues when writing/reading from parquet table = pa.Table.from_arrays([pa_array], names=['cpl'])
pq.write_table(table, "complex_test.parquet")
cpl_df = pq.read_table("complex_test.parquet")
# This line will create a numpy array of pyarrow arrays
cpl_df['cpl'].to_numpy()
# If you have a single chunk, this will work
assert (cpl_df['cpl'].chunks[0].to_numpy() == array).all()
# Otherwise the solution is more involved
start_length = 0
end_length = 0
max_length = cpl_df['cpl'].length()
final_numpy = np.zeros(shape=(max_length), dtype=np.complex128)
for chunk in cpl_df['cpl'].iterchunks():
temp_numpy = chunk.to_numpy()
end_length += temp_numpy.shape[0]
final_numpy[start_length:end_length] = temp_numpy
start_length = end_length
assert (final_numpy == array).all() |
Thanks, yes I would like to take a look.
Yes, I agree that this is the way forward. Mirroring what you've described back to you, this will necessitate API changes to the C++ Extension Framework to support Custom Writers?
It does work quite nicely in Python. In fact, I implemented something similar in a repository I'm working on: https://github.com/ska-sa/dask-ms/blob/55c987e5f00c24a82c16363f681a966e45590e06/daskms/experimental/arrow/extension_types.py#L145-L148. IIRC I ran into similar issues with chunked Extension Arrays. Do you have an idea as to how this should be handled in the C++ layer? |
Hi @jmdeschenes, I'll be devoting some time to this PR over the next few weeks. Would it be possible to see your branch? |
For the record, we now have an official procedure for standardizing extension types. If you would like to resurrect this, you should post a proposal (or start discussing about such a proposal) on the ML. |
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍 |
This PR is very out of date, a new PR supporting this functionality will be required. |
xref https://issues.apache.org/jira/browse/ARROW-638
#10452 contains the PR implemented Complex Numbers as First Class Types