fix: Do not convert nparray into list before wrapping into pandas.Series #23

OlegWock · 2025-11-07T08:28:14Z

Trying to construct pandas Series from a list of trino.types.NamedRowTuple fails, but works perfectly fine if we pass numpy array directly.

It's also caused by incorrect implementation of __getattr__ leading NumPy/Pandas to believe it supports array struct protocol:

For normal objects: when accessing __array_struct__, __getattr__ raises AttributeError and hasattr() returns False
For NamedRowTuple: when accessing __array_struct__, __getattr__ returns None (doesn't raise AttributeError) and hasattr() returns True

Traceback

Traceback (most recent call last):
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/dataframe_utils.py", line 78, in dataframe_formatter
    result = _describe_dataframe(native_df, browse_spec)
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/dataframe_utils.py", line 122, in _describe_dataframe
    columns_with_stats = browse_result.processed_df.analyze_columns(
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/dataframe.py", line 337, in analyze_columns
    return self._implementation.analyze_columns(color_scale_column_names)
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/pandas/implementation.py", line 321, in analyze_columns
    return analyze_columns(self._df, color_scale_column_names)
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/pandas/analyze.py", line 164, in analyze_columns
    columns[i].stats.categories = _get_categories(np.array(column))
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/pandas/analyze.py", line 27, in _get_categories
    pandas_series = pd.Series(np_array.tolist())
  File "/root/venv/lib/python3.10/site-packages/pandas/core/series.py", line 512, in __init__
    data = sanitize_array(data, index, dtype, copy)
  File "/root/venv/lib/python3.10/site-packages/pandas/core/construction.py", line 653, in sanitize_array
    subarr = maybe_convert_platform(data)
  File "/root/venv/lib/python3.10/site-packages/pandas/core/dtypes/cast.py", line 126, in maybe_convert_platform
    arr = construct_1d_object_array_from_listlike(values)
  File "/root/venv/lib/python3.10/site-packages/pandas/core/dtypes/cast.py", line 1565, in construct_1d_object_array_from_listlike
    result[:] = values
ValueError: invalid __array_struct__

Summary by CodeRabbit

Release Notes

Bug Fixes
- Improved data type inference and missing value handling in column analysis functionality.
Tests
- Added comprehensive test coverage for complex data type handling in column analysis, including scenarios with multiple entries and missing values.

linear · 2025-11-07T08:28:17Z

BLU-5137 Column analysis fails on `trino.types.NamedRowTuple`

coderabbitai · 2025-11-07T08:28:24Z

📝 Walkthrough

Walkthrough

The pull request changes _get_categories in the pandas analysis module to build the pandas Series directly from the NumPy array (pd.Series(np_array)) instead of from np_array.tolist(), which can alter dtype inference and NA handling; all other logic, exception handling, and return structure are unchanged. Tests were added to validate analyze_columns with Trino NamedRowTuple values, covering missing values, multiple/repeated entries, and category aggregation including an "others" bucket.

Sequence Diagram(s)

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: removing the .tolist() conversion before passing the numpy array to pandas.Series, which directly addresses the fix.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-07T08:28:46Z

📦 Python package built successfully!

Version: 1.1.1.dev4+e039330
Wheel: deepnote_toolkit-1.1.1.dev4+e039330-py3-none-any.whl

Install:

pip install "deepnote-toolkit @ https://deepnote-staging-runtime-artifactory.s3.amazonaws.com/deepnote-toolkit-packages/1.1.1.dev4%2Be039330/deepnote_toolkit-1.1.1.dev4%2Be039330-py3-none-any.whl"

codecov · 2025-11-07T08:30:00Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.39%. Comparing base (358138f) to head (cc011eb).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #23      +/-   ##
==========================================
- Coverage   76.94%   75.39%   -1.55%     
==========================================
  Files          99       99              
  Lines        5512     5625     +113     
  Branches      753      784      +31     
==========================================
  Hits         4241     4241              
- Misses       1271     1384     +113

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4f1dfbd and cc011eb.

📒 Files selected for processing (1)

tests/unit/test_analyze_columns_pandas.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

tests/unit/test_analyze_columns_pandas.py (1)

deepnote_toolkit/ocelots/pandas/analyze.py (1)

analyze_columns (102-200)

🪛 Ruff (0.14.3)

tests/unit/test_analyze_columns_pandas.py

599-599: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)

600-600: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)

601-601: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)

602-602: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)

603-603: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)

604-604: Use a regular assert instead of unittest-style assertIsInstance

Replace assertIsInstance(...) with assert ...

(PT009)

605-605: Use a regular assert instead of unittest-style assertGreater

Replace assertGreater(...) with assert ...

(PT009)

607-607: Use a regular assert instead of unittest-style assertIn

Replace assertIn(...) with assert ...

(PT009)

608-608: Use a regular assert instead of unittest-style assertIn

Replace assertIn(...) with assert ...

(PT009)

627-627: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)

628-628: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)

629-629: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)

632-632: Use a regular assert instead of unittest-style assertIn

Replace assertIn(...) with assert ...

(PT009)

637-637: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)

653-653: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)

654-654: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)

655-655: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)

656-656: Use a regular assert instead of unittest-style assertGreaterEqual

Replace assertGreaterEqual(...) with assert ...

(PT009)

657-657: Use a regular assert instead of unittest-style assertLessEqual

Replace assertLessEqual(...) with assert ...

(PT009)

660-660: Use a regular assert instead of unittest-style assertTrue

Replace assertTrue(...) with assert ...

(PT009)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)

GitHub Check: Build and push artifacts for Python 3.11
GitHub Check: Build and push artifacts for Python 3.9
GitHub Check: Build and push artifacts for Python 3.10
GitHub Check: Build and push artifacts for Python 3.13
GitHub Check: Build and push artifacts for Python 3.12
GitHub Check: Test - Python 3.9
GitHub Check: Test - Python 3.10
GitHub Check: Test - Python 3.11

🔇 Additional comments (3)

tests/unit/test_analyze_columns_pandas.py (3)

5-5: LGTM!

Import is necessary for the new Trino-specific tests.

610-637: LGTM!

Test properly validates missing value handling with NamedRowTuple objects.

639-660: LGTM!

Test correctly validates category aggregation with "others" bucket for many unique NamedRowTuple values.

tests/unit/test_analyze_columns_pandas.py

deepnote-bot · 2025-11-07T08:40:31Z

🚀 Review App Deployment Started

📝 Description	🌐 Link / Info
🌍 Review application	ra-23
🔑 Sign-in URL	Click to sign-in
📊 Application logs	View logs
🔄 Actions	Click to redeploy
🚀 ArgoCD deployment	View deployment
⏰ Last deployed	2025-11-07 08:40:28 (UTC)
📜 Deployed commit	`c47eb74206aeb574189ed0bef14c2d7cbfaf190e`
🛠️ Toolkit version	`e039330`

OlegWock · 2025-11-07T08:57:25Z

You can test this with following Trino SQL query

SELECT CAST(ROW(1, 'Alice') AS ROW(id INTEGER, name VARCHAR)) AS user
UNION ALL
SELECT CAST(ROW(2, 'Bob')   AS ROW(id INTEGER, name VARCHAR)) AS user

Or by constructing DF manually

from trino.types import NamedRowTuple
import pandas as pd
import numpy as np

row1 = NamedRowTuple(values=[1, "Alice"], names=["id", "name"], types=["integer", "varchar"])
row2 = NamedRowTuple(values=[2, "Bob"], names=["id", "name"], types=["integer", "varchar"])

np_array = np.empty(2, dtype=object)
np_array[0] = row1
np_array[1] = row2
df = pd.DataFrame({"col1": np_array})

df

fix: Do not convert nparray into list before wrapping into pandas.Series

4f1dfbd

Format

cc011eb

coderabbitai bot requested changes Nov 7, 2025

View reviewed changes

tests/unit/test_analyze_columns_pandas.py Show resolved Hide resolved

coderabbitai bot approved these changes Nov 7, 2025

View reviewed changes

OlegWock marked this pull request as ready for review November 7, 2025 08:57

OlegWock requested a review from a team as a code owner November 7, 2025 08:57

m1so approved these changes Nov 7, 2025

View reviewed changes

m1so merged commit 663dce1 into main Nov 7, 2025
33 of 34 checks passed

m1so deleted the oleh/blu-5137-column-analysis-fails-on-trinotypesnamedrowtuple branch November 7, 2025 14:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Do not convert nparray into list before wrapping into pandas.Series #23

fix: Do not convert nparray into list before wrapping into pandas.Series #23

Uh oh!

OlegWock commented Nov 7, 2025 •

edited by m1so

Loading

Uh oh!

linear bot commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Uh oh!

github-actions bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

codecov bot commented Nov 7, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

deepnote-bot commented Nov 7, 2025

Uh oh!

OlegWock commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix: Do not convert nparray into list before wrapping into pandas.Series #23

fix: Do not convert nparray into list before wrapping into pandas.Series #23

Uh oh!

Conversation

OlegWock commented Nov 7, 2025 • edited by m1so Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

linear bot commented Nov 7, 2025

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Pre-merge checks

Uh oh!

github-actions bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

deepnote-bot commented Nov 7, 2025

Uh oh!

OlegWock commented Nov 7, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

OlegWock commented Nov 7, 2025 •

edited by m1so

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading

github-actions bot commented Nov 7, 2025 •

edited

Loading

codecov bot commented Nov 7, 2025 •

edited

Loading