Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when mismatching types between glob files #17311

Closed
2 tasks done
failable opened this issue Jun 30, 2024 · 8 comments
Closed
2 tasks done

Panic when mismatching types between glob files #17311

failable opened this issue Jun 30, 2024 · 8 comments
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@failable
Copy link

failable commented Jun 30, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl

pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()

Log output

(med-data) user@macos:~/git/med-data $ POLARS_VERBOSE=1 rp
Python 3.10.11 (main, May  7 2023, 18:32:37) [Clang 16.0.3 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
thread 'thread 'polars-4polars-0' panicked at ' panicked at crates/polars-parquet/src/arrow/read/statistics/mod.rs/rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs::376250::435:
:
called `Option::unwrap()` on a `None` valueExpected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead

stack backtrace:
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
   0:        0x1114238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
   1:        0x10ec8239b - core::fmt::write::h4a73583a3886d3b0
   2:        0x1113f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
   3:        0x1114279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
   4:        0x111427269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
   5:        0x111428f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
   6:        0x111427cda - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
   7:        0x111427c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
   8:        0x111427c56 - _rust_begin_unwind
   9:        0x1115e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
  10:        0x1115e38e4 - core::panicking::panic::hb3e838924bd2f646
  11:        0x1115e3ca8 - core::option::unwrap_failed::h8fd98a81a93ecfe7
  12:        0x110973982 - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
  13:        0x10fb59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
  14:        0x10fb5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
  15:        0x10ffbe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  16:        0x10ffc0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
  17:        0x111931120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
  18:        0x1111ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
  19:        0x1111ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
  20:        0x11142c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
  21:     0x7ff803c5d18b - __pthread_start
stack backtrace:
   0:        0x1114238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
   1:        0x10ec8239b - core::fmt::write::h4a73583a3886d3b0
   2:        0x1113f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
   3:        0x1114279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
   4:        0x111427269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
   5:        0x111428f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
   6:        0x111427d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
   7:        0x111427c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
   8:        0x111427c56 - _rust_begin_unwind
   9:        0x1115e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
  10:        0x1109767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
  11:        0x110972ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
  12:        0x10fb59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
  13:        0x10fb5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
  14:        0x10ffbe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  15:        0x10ffc0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
  16:        0x111931120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
  17:        0x1111ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
  18:        0x1111ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
  19:        0x11142c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
  20:     0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
>>>

Issue description

The issue does not exist if I remove the drop_nulls part, e.g.

pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).collect()

The issue does not exist if I change the glob part to ANY specific parquet file, the issue does not exist, e.g.

>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()parquet file must be read, statistics not sufficient for predicate.
parquet file must be read, statistics not sufficient for predicate.
shape: (994_078, 2)
┌─────────────────────────────────┬──────────────────────────────┐
│ Genericname                     ┆ Diagnosis                    │
│ ---                             ┆ ---                          │
│ str                             ┆ str                          │
╞═════════════════════════════════╪══════════════════════════════╡
│ 灯盏生脉胶囊                    ┆ 类风湿性关节炎;心绞痛;银屑病 │
│ 头孢克洛分散片                  ┆ 皮肤感染;皮肤裂伤            │
│ 阿司匹林肠溶片;甲硝唑片;牙痛停  ┆ 牙周炎                       │
│ 滴丸                            ┆                              │
│ 奥硝唑分散片;头孢泊肟酯胶囊     ┆ 阑尾炎                       │
│ 玻璃酸钠滴眼液;肠胃宁片         ┆ 干眼症;泄泻病                │
│ …                               ┆ …                            │
│ 达格列净片;复方酮康唑发用洗剂   ┆ 糖尿病;头皮糠疹              │
│ 六神丸;维生素A软胶囊            ┆ 痤疮;咽炎                    │
│ 甲钴胺片;腰痛宁胶囊;依托考昔片  ┆ 腰椎病                       │
│ 急支糖浆;盐酸氨溴索糖浆         ┆ 上呼吸道感染                 │
│ 桂枝茯苓丸(浓缩水丸);血府逐瘀颗 ┆ 闭经;血瘀证                  │
│ 粒                              ┆                              │
└─────────────────────────────────┴──────────────────────────────┘
>>>

Expected behavior

No panics.

Installed versions

--------Version info---------
Polars: 0.20.31
Index type: UInt32
Platform: macOS-14.5-x86_64-i386-64bit
Python: 3.10.11 (main, May 7 2023, 18:32:37) [Clang 16.0.3 ]

----Optional dependencies----
adbc_driver_manager:
cloudpickle:
connectorx:
deltalake:
fastexcel:
fsspec: 2024.6.1
gevent:
hvplot:
matplotlib:
nest_asyncio:
numpy: 2.0.0
openpyxl: 3.1.5
pandas: 2.2.2
pyarrow: 16.1.0
pydantic:
pyiceberg:
pyxlsb:
sqlalchemy:
torch:
xlsx2csv:
xlsxwriter:

@failable failable added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 30, 2024
@ritchie46
Copy link
Member

@coastalwhite we don't have a repro, but we do have a panic on statistics unwrap. Maybe you know what it is?

@coastalwhite
Copy link
Collaborator

It is difficult to see, but there are two panics here I think.

  • An unwrap of a Option::None at crates/polars-parquet/src/arrow/read/statistics/mod.rs:376.
  • An expect_as_binary, I suspect at lines crates/polars-parquet/src/arrow/read/statistics/mod.rs, somewhere between 527 and 532.

I don't see an immediate problem, but since the problem only happens when globbing there might be a schema mismatch?

@failable
Copy link
Author

failable commented Jul 1, 2024

Hello, there are total 3 files. Not sure if these information helps.

user@macos:~/git/med-data $ ll data/*-*-*-*-*-*.parquet
-rw-r--r-- 1 user staff 41M Oct 20  2021 data/2020-02-04-2020-11-01.parquet
-rw-r--r-- 1 user staff 35M Oct 20  2021 data/2020-11-01-2021-03-01.parquet
-rw-r--r-- 1 user staff 59M Oct 20  2021 data/2021-03-01-2021-09-05.parquet

user@macos:~/git/med-data $ rp
Python 3.10.11 (main, May  7 2023, 18:32:37) [Clang 16.0.3 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import polars as pl
>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (789_880, 2)
┌─────────────────────────────────┬──────────────────────────┐
│ Genericname                     ┆ Diagnosis                │
│ ---                             ┆ ---                      │
│ str                             ┆ str                      │
╞═════════════════════════════════╪══════════════════════════╡
│ 磷酸奥司他韦颗粒                ┆ 预防性抗流行性感冒治疗   │
│ 硝苯地平控释片                  ┆ 原发性高血压             │
│ 富马酸替诺福韦二吡呋酯片        ┆ 慢性乙型肝炎             │
│ 苯磺酸氨氯地平片                ┆ 原发性高血压             │
│ 布地奈德福莫特罗粉吸入剂        ┆ 哮喘                     │
│ …                               ┆ …                        │
│ 金匮肾气丸;尿感宁颗粒           ┆ 尿路感染;肾气不足证      │
│ 地奈德乳膏;非洛地平缓释片;复方  ┆ 高血压病;气滞血瘀证;湿疹 │
│ 丹参滴丸                        ┆                          │
│ 牛黄解毒片;蒲地蓝消炎口服液;头  ┆ 牙龈炎                   │
│ 孢呋辛酯胶囊                    ┆                          │
│ 地奈德乳膏;替米沙坦片           ┆ 高血压;脂溢性皮炎        │
│ 头孢氨苄片                      ┆ 毛囊炎;中耳炎            │
└─────────────────────────────────┴──────────────────────────┘

>>> pl.scan_parquet("data/2020-11-01-2021-03-01.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (601_951, 2)
┌─────────────────────────────────┬───────────────────────────────┐
│ Genericname                     ┆ Diagnosis                     │
│ ---                             ┆ ---                           │
│ str                             ┆ str                           │
╞═════════════════════════════════╪═══════════════════════════════╡
│ 复方酮康唑软膏;鲜竹沥           ┆ 皮肤真菌感染;上呼吸道感染     │
│ 甲钴胺分散片;双氯芬酸钠缓释胶囊 ┆ 腰椎间盘突出                  │
│ 氨溴特罗口服溶液;孟鲁司特钠片   ┆ 上呼吸道感染;上呼吸道过敏反应 │
│ 护肝片;心脑康胶囊               ┆ 肝气郁结证;瘀血阻络证         │
│ 奥硝唑片;双氯芬酸钠缓释胶囊;头  ┆ 慢性牙周炎                    │
│ 孢克洛分散片                    ┆                               │
│ …                               ┆ …                             │
│ 埃索美拉唑镁肠溶片;玻璃酸钠滴眼 ┆ 干眼症;十二指肠溃疡           │
│ 液                              ┆                               │
│ 丹黄祛瘀胶囊;散结镇痛胶囊       ┆ 血瘀证;子宫内膜异位症         │
│ 陈香露白露片                    ┆ 慢性胃炎;特指急性胃炎         │
│ 玉龙油                          ┆ 关节炎;痛风                   │
│ 罗红霉素片;清热散结片           ┆ 口腔溃疡;皮肤感染             │
└─────────────────────────────────┴───────────────────────────────┘

>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()
shape: (994_078, 2)
┌─────────────────────────────────┬──────────────────────────────┐
│ Genericname                     ┆ Diagnosis                    │
│ ---                             ┆ ---                          │
│ str                             ┆ str                          │
╞═════════════════════════════════╪══════════════════════════════╡
│ 灯盏生脉胶囊                    ┆ 类风湿性关节炎;心绞痛;银屑病 │
│ 头孢克洛分散片                  ┆ 皮肤感染;皮肤裂伤            │
│ 阿司匹林肠溶片;甲硝唑片;牙痛停  ┆ 牙周炎                       │
│ 滴丸                            ┆                              │
│ 奥硝唑分散片;头孢泊肟酯胶囊     ┆ 阑尾炎                       │
│ 玻璃酸钠滴眼液;肠胃宁片         ┆ 干眼症;泄泻病                │
│ …                               ┆ …                            │
│ 达格列净片;复方酮康唑发用洗剂   ┆ 糖尿病;头皮糠疹              │
│ 六神丸;维生素A软胶囊            ┆ 痤疮;咽炎                    │
│ 甲钴胺片;腰痛宁胶囊;依托考昔片  ┆ 腰椎病                       │
│ 急支糖浆;盐酸氨溴索糖浆         ┆ 上呼吸道感染                 │
│ 桂枝茯苓丸(浓缩水丸);血府逐瘀颗 ┆ 闭经;血瘀证                  │
│ 粒                              ┆                              │
└─────────────────────────────────┴──────────────────────────────┘

>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").select(["Genericname", "Diagnosis"]).describe()
shape: (9, 3)
┌────────────┬───────────────────────┬─────────────────────────────────┐
│ statistic  ┆ Genericname           ┆ Diagnosis                       │
│ ---        ┆ ---                   ┆ ---                             │
│ str        ┆ str                   ┆ str                             │
╞════════════╪═══════════════════════╪═════════════════════════════════╡
│ count      ┆ 789880                ┆ 789880                          │
│ null_count ┆ 0                     ┆ 0                               │
│ mean       ┆ null                  ┆ null                            │
│ std        ┆ null                  ┆ null                            │
│ min        ┆  ;特非那定片&#x0D     ┆     肠炎  ;上呼吸道感染         │
│ 25%        ┆ null                  ┆ null                            │
│ 50%        ┆ null                  ┆ null                            │
│ 75%        ┆ null                  ┆ null                            │
│ max        ┆ (畅迪5号)粉尘螨滴剂 ┆ A族高甘油三脂血症;高血压病;脑  │
│            ┆                       ┆ 梗死后遗症                      │
└────────────┴───────────────────────┴─────────────────────────────────┘

>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").describe()
shape: (9, 17)
┌────────────┬───────────────┬──────────────────────┬──────────────────────┬───┬─────────┬───────────┬───────────┬──────────────────────┐
│ statistic  ┆ Id            ┆ Genericname          ┆ Diagnosis            ┆ … ┆ Checker ┆ CheckTime ┆ Confirmer ┆ ConfirmTime          │
│ ---        ┆ ---           ┆ ---                  ┆ ---                  ┆   ┆ ---     ┆ ---       ┆ ---       ┆ ---                  │
│ str        ┆ f64           ┆ str                  ┆ str                  ┆   ┆ str     ┆ str       ┆ str       ┆ str                  │
╞════════════╪═══════════════╪══════════════════════╪══════════════════════╪═══╪═════════╪═══════════╪═══════════╪══════════════════════╡
│ count      ┆ 789880.0      ┆ 789880               ┆ 789880               ┆ … ┆ 601427  ┆ 601427    ┆ 601427    ┆ 601427               │
│ null_count ┆ 0.0           ┆ 0                    ┆ 0                    ┆ … ┆ 188453  ┆ 188453    ┆ 188453    ┆ 188453               │
│ mean       ┆ 419382.573516 ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ std        ┆ 234739.505744 ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ min        ┆ 1.0           ┆  ;特非那定片&#x0D    ┆ 肠炎  ;上呼吸道感染  ┆ … ┆ 何黎敏  ┆ 2020/10/1 ┆ 何黎敏    ┆ 2020/10/1 15:00:13   │
│            ┆               ┆                      ┆                      ┆   ┆         ┆ 14:29:43  ┆           ┆                      │
│ 25%        ┆ 217296.0      ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ 50%        ┆ 421387.0      ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ 75%        ┆ 622312.0      ┆ null                 ┆ null                 ┆ … ┆ null    ┆ null      ┆ null      ┆ null                 │
│ max        ┆ 823498.0      ┆ (畅迪5号)粉尘螨滴  ┆ A族高甘油三脂血症;  ┆ … ┆ 黄羡    ┆ 2020/9/30 ┆ 黄羡      ┆ 2020/9/30 15:11:42   │
│            ┆               ┆ 剂                   ┆ 高血压病;脑梗死后遗  ┆   ┆         ┆ 15:22:40  ┆           ┆                      │
│            ┆               ┆                      ┆ 症                   ┆   ┆         ┆           ┆           ┆                      │
└────────────┴───────────────┴──────────────────────┴──────────────────────┴───┴─────────┴───────────┴───────────┴──────────────────────┘

>>> pl.scan_parquet("data/2020-02-04-2020-11-01.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/2020-11-01-2021-03-01.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/2021-03-01-2021-09-05.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").drop_nulls().columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").drop_nulls().collect().columns
thread 'polars-1' panicked at /rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs:250:5:
Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
   0:        0x1113238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
   1:        0x10eb8239b - core::fmt::write::h4a73583a3886d3b0
   2:        0x1112f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
   3:        0x1113279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
   4:        0x111327269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
   5:        0x111328f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
   6:        0x111327d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
   7:        0x111327c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
   8:        0x111327c56 - _rust_begin_unwind
   9:        0x1114e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
  10:        0x1108767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
  11:        0x110872ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
  12:        0x10fa59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
  13:        0x10fa5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
  14:        0x10febe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  15:        0x10febf682 - rayon_core::join::join_context::{{closure}}::hdb785e885a11ecf5
  16:        0x10febec58 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  17:        0x10fec0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
  18:        0x111831120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
  19:        0x1110ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
  20:        0x1110ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
  21:        0x11132c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
  22:     0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
>>> 

@coastalwhite
Copy link
Collaborator

One thing I notice here is that there are columns that are typed as strings but contain numbers. Could it maybe be that one of the files has the same column but with different types?

@failable
Copy link
Author

failable commented Jul 1, 2024

That seems to be the issue.

>>> files = ["data/2020-02-04-2020-11-01.parquet", "data/2020-11-01-2021-03-01.parquet", "data/2021-03-01-2021-09-05.parquet"]

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").columns
['Id', 'Genericname', 'Diagnosis', 'InquiryId', 'CreateTime', 'UpdateTime', 'InqCount', 'Level', 'UpdateBy', 'Creater', 'Platform', 'Remark', 'Checker', 'CheckTime', 'Confirmer', 'ConfirmTime']

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").collect().row(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented

>>> pl.scan_parquet("data/*-*-*-*-*-*.parquet").collect()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented

>>> import pandas as pd
>>> for f in files:
...     df = pd.read_parquet(f)
...     print(df.info())
... 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789880 entries, 0 to 789879
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           789880 non-null  int64 
 1   Genericname  789880 non-null  object
 2   Diagnosis    789880 non-null  object
 3   InquiryId    789880 non-null  object
 4   CreateTime   789880 non-null  object
 5   UpdateTime   9165 non-null    object
 6   InqCount     789880 non-null  int64 
 7   Level        789880 non-null  int64 
 8   UpdateBy     254 non-null     object
 9   Creater      789880 non-null  object
 10  Platform     712729 non-null  object
 11  Remark       15070 non-null   object
 12  Checker      601427 non-null  object
 13  CheckTime    601427 non-null  object
 14  Confirmer    601427 non-null  object
 15  ConfirmTime  601427 non-null  object
dtypes: int64(3), object(13)
memory usage: 96.4+ MB
None

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 601952 entries, 0 to 601951
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           601952 non-null  int64 
 1   Genericname  601952 non-null  object
 2   Diagnosis    601951 non-null  object
 3   InquiryId    601952 non-null  int64 <----------------------------------------- DIFFERENCE
 4   CreateTime   601952 non-null  object
 5   UpdateTime   1857 non-null    object
 6   InqCount     601952 non-null  int64 
 7   Level        601952 non-null  int64 
 8   UpdateBy     91 non-null      object
 9   Creater      601952 non-null  object
 10  Platform     599108 non-null  object
 11  Remark       8939 non-null    object
 12  Checker      599108 non-null  object
 13  CheckTime    599108 non-null  object
 14  Confirmer    599108 non-null  object
 15  ConfirmTime  599108 non-null  object
dtypes: int64(4), object(12)
memory usage: 73.5+ MB

None
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 994078 entries, 0 to 994077
Data columns (total 16 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   Id           994078 non-null  int64 
 1   Genericname  994078 non-null  object
 2   Diagnosis    994078 non-null  object
 3   InquiryId    994078 non-null  object
 4   CreateTime   994078 non-null  object
 5   UpdateTime   2329 non-null    object
 6   InqCount     994078 non-null  int64 
 7   Level        994078 non-null  int64 
 8   UpdateBy     8 non-null       object
 9   Creater      994078 non-null  object
 10  Platform     981809 non-null  object
 11  Remark       2654 non-null    object
 12  Checker      981809 non-null  object
 13  CheckTime    981809 non-null  object
 14  Confirmer    981809 non-null  object
 15  ConfirmTime  981809 non-null  object
dtypes: int64(3), object(13)
memory usage: 121.3+ MB
None

>>> pl.scan_parquet([files[0], files[2]]).collect()
shape: (1_783_958, 16)
┌─────────┬───────────────────┬───────────────────┬───────────────────┬───┬──────────────┬───────────┬──────────────┬───────────────────┐
│ Id      ┆ Genericname       ┆ Diagnosis         ┆ InquiryId         ┆ … ┆ Checker      ┆ CheckTime ┆ Confirmer    ┆ ConfirmTime       │
│ ---     ┆ ---               ┆ ---               ┆ ---               ┆   ┆ ---          ┆ ---       ┆ ---          ┆ ---               │
│ i64     ┆ str               ┆ str               ┆ str               ┆   ┆ str          ┆ str       ┆ str          ┆ str               │
╞═════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══╪══════════════╪═══════════╪══════════════╪═══════════════════╡
│ 1       ┆ 磷酸奥司他韦颗粒  ┆ 预防性抗流行性感  ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│         ┆                   ┆ 冒治疗            ┆                   ┆   ┆              ┆           ┆              ┆                   │
│ 2       ┆ 硝苯地平控释片    ┆ 原发性高血压      ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│ 4       ┆ 富马酸替诺福韦二  ┆ 慢性乙型肝炎      ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│         ┆ 吡呋酯片          ┆                   ┆                   ┆   ┆              ┆           ┆              ┆                   │
│ 5       ┆ 苯磺酸氨氯地平片  ┆ 原发性高血压      ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│ 6       ┆ 布地奈德福莫特罗  ┆ 哮喘              ┆ 0                 ┆ … ┆ null         ┆ null      ┆ null         ┆ null              │
│         ┆ 粉吸入剂          ┆                   ┆                   ┆   ┆              ┆           ┆              ┆                   │
│ …       ┆ …                 ┆ …                 ┆ …                 ┆ … ┆ …            ┆ …         ┆ …            ┆ …                 │
│ 2428039 ┆ 达格列净片;复方酮 ┆ 糖尿病;头皮糠疹   ┆ 14346925769166766 ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:23  │
│         ┆ 康唑发用洗剂      ┆                   ┆ 96                ┆   ┆              ┆ 9:39:23   ┆              ┆                   │
│ 2428040 ┆ 六神丸;维生素A软  ┆ 痤疮;咽炎         ┆ 3862545784759552  ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:28  │
│         ┆ 胶囊              ┆                   ┆                   ┆   ┆              ┆ 9:39:28   ┆              ┆                   │
│ 2428041 ┆ 甲钴胺片;腰痛宁胶 ┆ 腰椎病            ┆ 3862545878415360  ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:35  │
│         ┆ 囊;依托考昔片     ┆                   ┆                   ┆   ┆              ┆ 9:39:35   ┆              ┆                   │
│ 2428042 ┆ 急支糖浆;盐酸氨溴 ┆ 上呼吸道感染      ┆ 4347993676906752  ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:36  │
│         ┆ 索糖浆            ┆                   ┆                   ┆   ┆              ┆ 9:39:36   ┆              ┆                   │
│ 2428043 ┆ 桂枝茯苓丸(浓缩水 ┆ 闭经;血瘀证       ┆ 14346923868894577 ┆ … ┆ 智能审方判断 ┆ 2021/9/6  ┆ 智能审方判断 ┆ 2021/9/6 9:39:41  │
│         ┆ 丸);血府逐瘀颗粒  ┆                   ┆ 53                ┆   ┆              ┆ 9:39:41   ┆              ┆                   │
└─────────┴───────────────────┴───────────────────┴───────────────────┴───┴──────────────┴───────────┴──────────────┴───────────────────┘

>>> pl.scan_parquet([files[0], files[2]]).drop_nulls().collect()
shape: (48, 16)
┌────────┬───────────────────┬───────────────────┬───────────────────┬───┬──────────────┬────────────┬──────────────┬───────────────────┐
│ Id     ┆ Genericname       ┆ Diagnosis         ┆ InquiryId         ┆ … ┆ Checker      ┆ CheckTime  ┆ Confirmer    ┆ ConfirmTime       │
│ ---    ┆ ---               ┆ ---               ┆ ---               ┆   ┆ ---          ┆ ---        ┆ ---          ┆ ---               │
│ i64    ┆ str               ┆ str               ┆ str               ┆   ┆ str          ┆ str        ┆ str          ┆ str               │
╞════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══╪══════════════╪════════════╪══════════════╪═══════════════════╡
│ 210034 ┆ 利拉鲁肽注射液;缬 ┆ 缺血性脑血管病;糖 ┆ 3692776318482176  ┆ … ┆ 陈佩斯       ┆ 2020/5/15  ┆ 韩丽琴       ┆ 2020/5/15         │
│        ┆ 沙坦氢氯噻嗪胶囊; ┆ 尿病;原发性高血压 ┆                   ┆   ┆              ┆ 4:04:59    ┆              ┆ 19:26:42          │
│        ┆ 银杏叶提取物片    ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 210734 ┆ 阿奇霉素分散片;枸 ┆ 高血压病;男性勃起 ┆ 3692783911425792  ┆ … ┆ 陈佩斯       ┆ 2020/5/15  ┆ 唐明嵩       ┆ 2020/5/15         │
│        ┆ 橼酸西地那非片;马 ┆ 障碍;软组织感染   ┆                   ┆   ┆              ┆ 2:57:29    ┆              ┆ 20:35:13          │
│        ┆ 来酸依那普利片;双 ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│        ┆ 氯芬酸…           ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 211476 ┆ 复方酮康唑发用洗  ┆ 肺动脉高压;甲状腺 ┆ 3692745775624705  ┆ … ┆ 陈佩斯       ┆ 2020/5/15  ┆ 唐明嵩       ┆ 2020/5/15         │
│        ┆ 剂;枸橼酸西地那非 ┆ 功能减退症;心绞痛 ┆                   ┆   ┆              ┆ 4:59:10    ┆              ┆ 22:01:27          │
│        ┆ 片;通脉颗粒;左甲  ┆ ;脂溢性皮炎       ┆                   ┆   ┆              ┆            ┆              ┆                   │
│        ┆ 状腺素钠…         ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 214903 ┆ 苯磺酸左氨氯地平  ┆ 高血压病;男性勃起 ┆ 12607392863288279 ┆ … ┆ 唐明嵩       ┆ 2020/5/16  ┆ 吴雪静       ┆ 2020/5/16         │
│        ┆ 片;枸橼酸西地那非 ┆ 障碍              ┆ 22                ┆   ┆              ┆ 16:17:25   ┆              ┆ 17:19:47          │
│        ┆ 片                ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 220492 ┆ 富马酸比索洛尔片; ┆ 不稳定性心绞痛;男 ┆ 3693396635109120  ┆ … ┆ 陈祉羽       ┆ 2020/5/17  ┆ 苏锡茵       ┆ 2020/5/18 8:05:50 │
│        ┆ 枸橼酸西地那非片  ┆ 性勃起障碍        ┆                   ┆   ┆              ┆ 14:54:51   ┆              ┆                   │
│ …      ┆ …                 ┆ …                 ┆ …                 ┆ … ┆ …            ┆ …          ┆ …            ┆ …                 │
│ 743981 ┆ 阿司匹林肠溶片;胱 ┆ 头皮糠疹;脱发     ┆ 13166628469894104 ┆ … ┆ 智能审方判断 ┆ 2020/10/15 ┆ 智能审方判断 ┆ 2020/10/15        │
│        ┆ 氨酸片            ┆                   ┆ 02                ┆   ┆              ┆ 16:57:01   ┆              ┆ 16:57:01          │
│ 745923 ┆ 坎地沙坦酯片;马来 ┆ 高血压病          ┆ 3724199280330496  ┆ … ┆ 翁庸徳       ┆ 2020/10/10 ┆ 苏锡茵       ┆ 2020/10/15        │
│        ┆ 酸依那普利片      ┆                   ┆                   ┆   ┆              ┆ 18:19:03   ┆              ┆ 22:40:58          │
│ 751981 ┆ 酚酞片;牛黄解毒片 ┆ 便秘病;热毒证     ┆ 12906196614272082 ┆ … ┆ 黄羡         ┆ 2020/10/12 ┆ 翁庸徳       ┆ 2020/10/16        │
│        ┆                   ┆                   ┆ 79                ┆   ┆              ┆ 9:25:59    ┆              ┆ 22:29:34          │
│ 809815 ┆ 地特胰岛素注射液; ┆ 1型糖尿病;高血压  ┆ 3713215386539776  ┆ … ┆ 翁庸徳       ┆ 2020/10/28 ┆ 苏锡茵       ┆ 2020/10/29        │
│        ┆ 厄贝沙坦片;罗红霉 ┆ 病;支气管炎       ┆                   ┆   ┆              ┆ 16:21:42   ┆              ┆ 11:17:13          │
│        ┆ 素氨溴索片        ┆                   ┆                   ┆   ┆              ┆            ┆              ┆                   │
│ 811401 ┆ 非诺贝特胶囊;门冬 ┆ 1型糖尿病;高脂血  ┆ 3732088374658816  ┆ … ┆ 翁庸徳       ┆ 2020/10/28 ┆ 苏锡茵       ┆ 2020/10/29        │
│        ┆ 胰岛素注射液      ┆ 症                ┆                   ┆   ┆              ┆ 16:35:08   ┆              ┆ 17:47:00          │
└────────┴───────────────────┴───────────────────┴───────────────────┴───┴──────────────┴────────────┴──────────────┴───────────────────┘

>>> pl.scan_parquet([files[0], files[1]]).drop_nulls().collect()
thread 'polars-1' panicked at /rustc/ab14f944afe4234db378ced3801e637eae6c0f30/library/core/src/ops/function.rs:250:5:
Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead
stack backtrace:
   0:        0x1113238c7 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h5b162cab46f344a5
   1:        0x10eb8239b - core::fmt::write::h4a73583a3886d3b0
   2:        0x1112f4d9e - std::io::Write::write_fmt::h8846f8d604484bad
   3:        0x1113279d1 - std::sys_common::backtrace::print::h7eceb11702f657b6
   4:        0x111327269 - std::panicking::default_hook::{{closure}}::he179a4d2e5ce811d
   5:        0x111328f93 - std::panicking::rust_panic_with_hook::hbfe888ce2af6ee0d
   6:        0x111327d12 - std::panicking::begin_panic_handler::{{closure}}::h2461a6874e053e43
   7:        0x111327c69 - std::sys_common::backtrace::__rust_end_short_backtrace::h1c49106eba8c7b96
   8:        0x111327c56 - _rust_begin_unwind
   9:        0x1114e3812 - core::panicking::panic_fmt::ha4b3f782c24c0530
  10:        0x1108767ce - core::ops::function::FnOnce::call_once::h5067212405562c9f
  11:        0x110872ffe - polars_parquet::arrow::read::statistics::push::h9d0d5787bd19f3b8
  12:        0x10fa59d44 - polars_io::parquet::read::predicates::read_this_row_group::haa1c0f7f42e29dac
  13:        0x10fa5c39d - polars_io::parquet::read::read_impl::rg_to_dfs::h372c4af67b7d657a
  14:        0x10febe335 - rayon::iter::plumbing::bridge_producer_consumer::helper::h1e5b6564c8c35e3b
  15:        0x10fec0227 - <rayon_core::job::StackJob<L,F,R> as rayon_core::job::Job>::execute::hb75d5f7e45293dab
  16:        0x111831120 - rayon_core::registry::WorkerThread::wait_until_cold::hfdbee4fa4f1f2f01
  17:        0x1110ce84f - std::sys_common::backtrace::__rust_begin_short_backtrace::h2512fac84c638486
  18:        0x1110ce63c - core::ops::function::FnOnce::call_once{{vtable.shim}}::h47bee49a5d47d169
  19:        0x11132c24b - std::sys::pal::unix::thread::Thread::new::thread_start::h176c25cd13ced921
  20:     0x7ff803c5d18b - __pthread_start
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/user/git/med-data/.venv/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1967, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: Expected Statistics to be BinaryStatistics, found PrimitiveStatistics<i64> instead

But is that mean even specific columns are selected, all the schema will be checked ?

pl.scan_parquet("data/*.parquet").select(["Genericname", "Diagnosis"]).drop_nulls().collect()

Is this error polars.exceptions.ComputeError: not implemented: reading parquet type Int64 to Utf8View still not implemented relevant? Both error messages seems a bit hard for me to locate the problem.

@coastalwhite coastalwhite changed the title Panic when drop nulls Panic when mismatching types between glob files Jul 1, 2024
@ritchie46
Copy link
Member

@failable if this still occurs after #17321, can you open a new issue with a proper reproducable expample? We cannot take action on this one.

@failable
Copy link
Author

failable commented Jul 2, 2024

@ritchie46 Thanks, seems the issue has been fixed now!

image

@failable
Copy link
Author

failable commented Jul 2, 2024

When will we have a release? It took me an hour to build the main branch on my local machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

3 participants