Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option::unwrap() panic when loading Parquet files with non-matching schemas as a list of files #13436

Closed
2 tasks done
alf239 opened this issue Jan 4, 2024 · 11 comments · Fixed by #17321
Closed
2 tasks done
Assignees
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions accepted Ready for implementation bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@alf239
Copy link

alf239 commented Jan 4, 2024

Checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

df1 = pl.DataFrame({
    'some-id': [1, 2, 3],
    'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'],
})
df2 = pl.DataFrame({
    'some-id': [4],
})

df1.write_parquet("/tmp/df1.parquet")
df2.write_parquet("/tmp/df2.parquet")

pl.read_parquet(["/tmp/df1.parquet", "/tmp/df2.parquet"], columns="some-id")

Log output

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-parquet/src/arrow/read/deserialize/mod.rs:141:31:
called `Option::unwrap()` on a `None` value
stack backtrace:
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-parquet/src/arrow/read/deserialize/mod.rs:141:31:
called `Option::unwrap()` on a `None` value
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-parquet/src/arrow/read/deserialize/mod.rs:141:31:
called `Option::unwrap()` on a `None` value
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-parquet/src/arrow/read/deserialize/mod.rs:141:31:
called `Option::unwrap()` on a `None` value
   0: rust_begin_unwind
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:144:5
   3: polars_io::parquet::read_impl::column_idx_to_series
   4: rayon::iter::plumbing::bridge_producer_consumer::helper
   5: rayon_core::join::join_context::{{closure}}
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: rayon_core::join::join_context::{{closure}}
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon::iter::plumbing::bridge_producer_consumer::helper
  11: rayon_core::join::join_context::{{closure}}
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: rayon_core::join::join_context::{{closure}}
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: rayon_core::join::join_context::{{closure}}
  16: rayon::iter::plumbing::bridge_producer_consumer::helper
  17: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  18: rayon_core::registry::WorkerThread::wait_until_cold
  19: rayon_core::join::join_context::{{closure}}
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: rayon_core::join::join_context::{{closure}}
  22: rayon::iter::plumbing::bridge_producer_consumer::helper
  23: rayon_core::join::join_context::{{closure}}
  24: rayon::iter::plumbing::bridge_producer_consumer::helper
  25: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  26: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:144:5
   3: polars_io::parquet::read_impl::column_idx_to_series
   4: rayon::iter::plumbing::bridge_producer_consumer::helper
   5: rayon_core::join::join_context::{{closure}}
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: rayon_core::join::join_context::{{closure}}
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon::iter::plumbing::bridge_producer_consumer::helper
  11: rayon_core::join::join_context::{{closure}}
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  14: rayon_core::registry::WorkerThread::wait_until_cold
  15: rayon_core::join::join_context::{{closure}}
  16: rayon::iter::plumbing::bridge_producer_consumer::helper
  17: rayon_core::join::join_context::{{closure}}
  18: rayon::iter::plumbing::bridge_producer_consumer::helper
  19: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  20: rayon_core::registry::WorkerThread::wait_until_cold
  21: rayon_core::join::join_context::{{closure}}
  22: rayon::iter::plumbing::bridge_producer_consumer::helper
  23: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  24: rayon_core::registry::WorkerThread::wait_until_cold
  25: rayon_core::join::join_context::{{closure}}
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: rayon_core::join::join_context::{{closure}}
  28: rayon::iter::plumbing::bridge_producer_consumer::helper
  29: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  30: rayon_core::registry::WorkerThread::wait_until_cold
  31: rayon_core::join::join_context::{{closure}}
  32: rayon::iter::plumbing::bridge_producer_consumer::helper
  33: rayon_core::join::join_context::{{closure}}
  34: rayon::iter::plumbing::bridge_producer_consumer::helper
  35: rayon_core::join::join_context::{{closure}}
  36: rayon::iter::plumbing::bridge_producer_consumer::helper
  37: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  38: rayon_core::registry::WorkerThread::wait_until_cold
  39: rayon_core::join::join_context::{{closure}}
  40: rayon::iter::plumbing::bridge_producer_consumer::helper
  41: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  42: rayon_core::registry::WorkerThread::wait_until_cold
  43: rayon_core::join::join_context::{{closure}}
  44: rayon::iter::plumbing::bridge_producer_consumer::helper
  45: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  46: rayon_core::registry::WorkerThread::wait_until_cold
  47: rayon_core::join::join_context::{{closure}}
  48: rayon::iter::plumbing::bridge_producer_consumer::helper
  49: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  50: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:144:5
   3: polars_io::parquet::read_impl::column_idx_to_series
   4: rayon::iter::plumbing::bridge_producer_consumer::helper
   5: rayon_core::join::join_context::{{closure}}
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: rayon_core::join::join_context::{{closure}}
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  10: rayon_core::registry::WorkerThread::wait_until_cold
  11: rayon_core::join::join_context::{{closure}}
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: rayon_core::join::join_context::{{closure}}
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  16: rayon_core::registry::WorkerThread::wait_until_cold
  17: rayon_core::join::join_context::{{closure}}
  18: rayon::iter::plumbing::bridge_producer_consumer::helper
  19: rayon_core::join::join_context::{{closure}}
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  22: rayon_core::registry::WorkerThread::wait_until_cold
  23: rayon_core::join::join_context::{{closure}}
  24: rayon::iter::plumbing::bridge_producer_consumer::helper
  25: rayon_core::join::join_context::{{closure}}
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: rayon_core::join::join_context::{{closure}}
  28: rayon::iter::plumbing::bridge_producer_consumer::helper
  29: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  30: rayon_core::registry::WorkerThread::wait_until_cold
  31: rayon_core::join::join_context::{{closure}}
  32: rayon::iter::plumbing::bridge_producer_consumer::helper
  33: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  34: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: rust_begin_unwind
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:72:14
   2: core::panicking::panic
             at /rustc/d6d7a93866f2ffcfb51828b8859bdad760b54ce0/library/core/src/panicking.rs:144:5
   3: polars_io::parquet::read_impl::column_idx_to_series
   4: rayon::iter::plumbing::bridge_producer_consumer::helper
   5: rayon_core::join::join_context::{{closure}}
   6: rayon::iter::plumbing::bridge_producer_consumer::helper
   7: rayon_core::join::join_context::{{closure}}
   8: rayon::iter::plumbing::bridge_producer_consumer::helper
   9: rayon_core::join::join_context::{{closure}}
  10: rayon::iter::plumbing::bridge_producer_consumer::helper
  11: rayon_core::join::join_context::{{closure}}
  12: rayon::iter::plumbing::bridge_producer_consumer::helper
  13: rayon_core::join::join_context::{{closure}}
  14: rayon::iter::plumbing::bridge_producer_consumer::helper
  15: rayon_core::join::join_context::{{closure}}
  16: rayon::iter::plumbing::bridge_producer_consumer::helper
  17: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  18: rayon_core::registry::WorkerThread::wait_until_cold
  19: rayon_core::join::join_context::{{closure}}
  20: rayon::iter::plumbing::bridge_producer_consumer::helper
  21: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  22: rayon_core::registry::WorkerThread::wait_until_cold
  23: rayon_core::join::join_context::{{closure}}
  24: rayon::iter::plumbing::bridge_producer_consumer::helper
  25: rayon_core::join::join_context::{{closure}}
  26: rayon::iter::plumbing::bridge_producer_consumer::helper
  27: rayon_core::join::join_context::{{closure}}
  28: rayon::iter::plumbing::bridge_producer_consumer::helper
  29: <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute
  30: rayon_core::registry::WorkerThread::wait_until_cold
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[3], line 2
      1 start = time.time()
----> 2 data = pl.read_parquet(['/mnt/long/path/1/some_data.parquet',
      3  '/mnt/long/path/2/some_data.parquet']
      4                       , columns=['SomeId']
      5                       )
      6 end = time.time()
      7 print(end - start)

File /home/Project/venv/lib/python3.9/site-packages/polars/io/parquet/functions.py:183, in read_parquet(source, columns, n_rows, row_count_name, row_count_offset, parallel, use_statistics, hive_partitioning, rechunk, low_memory, storage_options, retries, use_pyarrow, pyarrow_options, memory_map)
    180         columns = [lf.columns[i] for i in columns]
    181     lf = lf.select(columns)
--> 183 return lf.collect(no_optimization=True)

File /home/Project/venv/lib/python3.9/site-packages/polars/lazyframe/frame.py:1749, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, slice_pushdown, comm_subplan_elim, comm_subexpr_elim, no_optimization, streaming, background, _eager)
   1746 if background:
   1747     return InProcessQuery(ldf.collect_concurrently())
-> 1749 return wrap_df(ldf.collect())

PanicException: called `Option::unwrap()` on a `None` value

Issue description

A simple

data = pl.read_parquet(['/mnt/long/path/1/some_data.parquet',  '/mnt/long/path/2/some_data.parquet']
                      , columns=['SomeId']
                      )

script errors out with the message/backtrace above.

At the same time, two separate reads,

for f in ['/mnt/long/path/1/some_data.parquet',  '/mnt/long/path/2/some_data.parquet']:
    print(pl.read_parquet(f, columns=['SomeId']))

succeed and output one-line (and one-column) dataframe each.

Not every pair does that; I have some 7000 files, and this is just that smallest range exhibiting this behaviour. It's not the only one, but I didn't isolate further.

Apologies, I do know it's not exactly a good report - but hopefully it'll point towards something interesting and the world will be better. And Happy New Year!

Expected behavior

Polars loads a dataframe with 2 rows

Installed versions

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             Linux-5.15.0-76-generic-x86_64-with-glibc2.31
Python:               3.9.18 (main, Aug 25 2023, 13:20:04) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          3.0.0
connectorx:           <not installed>
deltalake:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.1
numpy:                1.26.1
openpyxl:             <not installed>
pandas:               2.0.3
pyarrow:              14.0.1
pydantic:             2.5.2
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@alf239 alf239 added bug Something isn't working python Related to Python Polars labels Jan 4, 2024
@mcrumiller
Copy link
Contributor

I'm trying to hit the failing code path and having trouble without a minimal PR. Can you provide some of the expected datatypes of the files that are failing? How many columns do they have?

@alf239
Copy link
Author

alf239 commented Jan 4, 2024

Ah, good point, let me see...

For the example above, the schemas are,

OrderedDict([('SomeId', Int32)])
OrderedDict([('SomeId', Int32)])

but then probably the problem is due to the files having different schemas, one having 2 extra columns — I hoped that explicitly specifying the columns would work around that problem (and in 0.19.5 that worked when using globs), but apparently not here.

@alf239
Copy link
Author

alf239 commented Jan 4, 2024

One file has 3 extra String columns scattered, all past the SomeId; let me see if I can get an example isolated.

@mcrumiller
Copy link
Contributor

For the example above, the schemas are,

OrderedDict([('SomeId', Int32)])
OrderedDict([('SomeId', Int32)])

but then probably the problem is due to the files having different schemas

Hang on, those schemas look identical, how are they different?

@alf239 alf239 changed the title [No repro] Option::unwrap() panic when loading some Parquet files as a list of files Option::unwrap() panic when loading Parquet files with non-matching schemas as a list of files Jan 4, 2024
@alf239
Copy link
Author

alf239 commented Jan 4, 2024

Yep, here you go:

import polars as pl

df1 = pl.DataFrame({
    'some-id': [1, 2, 3],
    'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'],
})
df2 = pl.DataFrame({
    'some-id': [4],
})

df1.write_parquet("/tmp/df1.parquet")
df2.write_parquet("/tmp/df2.parquet")

pl.read_parquet(["/tmp/df1.parquet", "/tmp/df2.parquet"], columns="some-id")

adjusting the description, too

@alf239
Copy link
Author

alf239 commented Jan 4, 2024

It might look unreasonable, but then Polars 0.19.5 could do

import polars as pl

df1 = pl.DataFrame({
    'some-id': [1, 2, 3],
    'client': ['Alice Anders', 'Bob Baker', 'Charlie Chaplin'],
})
df2 = pl.DataFrame({
    'some-id': [4],
})

df1.write_parquet("/tmp/df1.parquet")
df2.write_parquet("/tmp/df2.parquet")

pl.read_parquet("/tmp/df*.parquet", columns="some-id")

successfully

(I do remember https://xkcd.com/1172/)

@mcrumiller
Copy link
Contributor

I've got it. FYI you're not using the pyarrow path, so for hte time being try use_pyarrow=True which may be a bit faster and doesn't trigger this error. If you don't have it installed, run pip install pyarrow.

@alf239
Copy link
Author

alf239 commented Jan 4, 2024

In 0.20.3 the glob example panics

@alf239
Copy link
Author

alf239 commented Jan 4, 2024

use_pyarrow=True solves my problem, thank you! It doesn't support globs though (again, nothing blocking me here, just filling in the details)

@mcrumiller
Copy link
Contributor

I've opened a more general issue at #13438 that will address this problem.

@stinodego stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024
@stinodego stinodego added the A-io-parquet Area: reading/writing Parquet files label Jan 21, 2024
@stinodego stinodego added the A-panic Area: code that results in panic exceptions label Jun 17, 2024
@nameexhaustion nameexhaustion added accepted Ready for implementation P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jun 19, 2024
@nameexhaustion
Copy link
Collaborator

I want to get to this eventually given the recently introduced support for reading from directories

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions accepted Ready for implementation bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants