Skip to content

Conversation

@OlegWock
Copy link
Member

@OlegWock OlegWock commented Nov 7, 2025

Trying to construct pandas Series from a list of trino.types.NamedRowTuple fails, but works perfectly fine if we pass numpy array directly.

It's also caused by incorrect implementation of __getattr__ leading NumPy/Pandas to believe it supports array struct protocol:

  1. For normal objects: when accessing __array_struct__, __getattr__ raises AttributeError and hasattr() returns False
  2. For NamedRowTuple: when accessing __array_struct__, __getattr__ returns None (doesn't raise AttributeError) and hasattr() returns True
Traceback
Traceback (most recent call last):
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/dataframe_utils.py", line 78, in dataframe_formatter
    result = _describe_dataframe(native_df, browse_spec)
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/dataframe_utils.py", line 122, in _describe_dataframe
    columns_with_stats = browse_result.processed_df.analyze_columns(
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/dataframe.py", line 337, in analyze_columns
    return self._implementation.analyze_columns(color_scale_column_names)
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/pandas/implementation.py", line 321, in analyze_columns
    return analyze_columns(self._df, color_scale_column_names)
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/pandas/analyze.py", line 164, in analyze_columns
    columns[i].stats.categories = _get_categories(np.array(column))
  File "/toolkit-cache/1.1.1/python3.10/kernel-libs/lib/python3.10/site-packages/deepnote_toolkit/ocelots/pandas/analyze.py", line 27, in _get_categories
    pandas_series = pd.Series(np_array.tolist())
  File "/root/venv/lib/python3.10/site-packages/pandas/core/series.py", line 512, in __init__
    data = sanitize_array(data, index, dtype, copy)
  File "/root/venv/lib/python3.10/site-packages/pandas/core/construction.py", line 653, in sanitize_array
    subarr = maybe_convert_platform(data)
  File "/root/venv/lib/python3.10/site-packages/pandas/core/dtypes/cast.py", line 126, in maybe_convert_platform
    arr = construct_1d_object_array_from_listlike(values)
  File "/root/venv/lib/python3.10/site-packages/pandas/core/dtypes/cast.py", line 1565, in construct_1d_object_array_from_listlike
    result[:] = values
ValueError: invalid __array_struct__

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Improved data type inference and missing value handling in column analysis functionality.
  • Tests

    • Added comprehensive test coverage for complex data type handling in column analysis, including scenarios with multiple entries and missing values.

@linear
Copy link

linear bot commented Nov 7, 2025

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

📝 Walkthrough

Walkthrough

The pull request changes _get_categories in the pandas analysis module to build the pandas Series directly from the NumPy array (pd.Series(np_array)) instead of from np_array.tolist(), which can alter dtype inference and NA handling; all other logic, exception handling, and return structure are unchanged. Tests were added to validate analyze_columns with Trino NamedRowTuple values, covering missing values, multiple/repeated entries, and category aggregation including an "others" bucket.

Sequence Diagram(s)

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: removing the .tolist() conversion before passing the numpy array to pandas.Series, which directly addresses the fix.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Nov 7, 2025

📦 Python package built successfully!

  • Version: 1.1.1.dev4+e039330
  • Wheel: deepnote_toolkit-1.1.1.dev4+e039330-py3-none-any.whl
  • Install:
    pip install "deepnote-toolkit @ https://deepnote-staging-runtime-artifactory.s3.amazonaws.com/deepnote-toolkit-packages/1.1.1.dev4%2Be039330/deepnote_toolkit-1.1.1.dev4%2Be039330-py3-none-any.whl"

@codecov
Copy link

codecov bot commented Nov 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.39%. Comparing base (358138f) to head (cc011eb).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #23      +/-   ##
==========================================
- Coverage   76.94%   75.39%   -1.55%     
==========================================
  Files          99       99              
  Lines        5512     5625     +113     
  Branches      753      784      +31     
==========================================
  Hits         4241     4241              
- Misses       1271     1384     +113     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4f1dfbd and cc011eb.

📒 Files selected for processing (1)
  • tests/unit/test_analyze_columns_pandas.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
tests/unit/test_analyze_columns_pandas.py (1)
deepnote_toolkit/ocelots/pandas/analyze.py (1)
  • analyze_columns (102-200)
🪛 Ruff (0.14.3)
tests/unit/test_analyze_columns_pandas.py

599-599: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)


600-600: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)


601-601: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)


602-602: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)


603-603: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)


604-604: Use a regular assert instead of unittest-style assertIsInstance

Replace assertIsInstance(...) with assert ...

(PT009)


605-605: Use a regular assert instead of unittest-style assertGreater

Replace assertGreater(...) with assert ...

(PT009)


607-607: Use a regular assert instead of unittest-style assertIn

Replace assertIn(...) with assert ...

(PT009)


608-608: Use a regular assert instead of unittest-style assertIn

Replace assertIn(...) with assert ...

(PT009)


627-627: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)


628-628: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)


629-629: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)


632-632: Use a regular assert instead of unittest-style assertIn

Replace assertIn(...) with assert ...

(PT009)


637-637: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)


653-653: Use a regular assert instead of unittest-style assertEqual

Replace assertEqual(...) with assert ...

(PT009)


654-654: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)


655-655: Use a regular assert instead of unittest-style assertIsNotNone

Replace assertIsNotNone(...) with assert ...

(PT009)


656-656: Use a regular assert instead of unittest-style assertGreaterEqual

Replace assertGreaterEqual(...) with assert ...

(PT009)


657-657: Use a regular assert instead of unittest-style assertLessEqual

Replace assertLessEqual(...) with assert ...

(PT009)


660-660: Use a regular assert instead of unittest-style assertTrue

Replace assertTrue(...) with assert ...

(PT009)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: Build and push artifacts for Python 3.11
  • GitHub Check: Build and push artifacts for Python 3.9
  • GitHub Check: Build and push artifacts for Python 3.10
  • GitHub Check: Build and push artifacts for Python 3.13
  • GitHub Check: Build and push artifacts for Python 3.12
  • GitHub Check: Test - Python 3.9
  • GitHub Check: Test - Python 3.10
  • GitHub Check: Test - Python 3.11
🔇 Additional comments (3)
tests/unit/test_analyze_columns_pandas.py (3)

5-5: LGTM!

Import is necessary for the new Trino-specific tests.


610-637: LGTM!

Test properly validates missing value handling with NamedRowTuple objects.


639-660: LGTM!

Test correctly validates category aggregation with "others" bucket for many unique NamedRowTuple values.

@deepnote-bot
Copy link

🚀 Review App Deployment Started

📝 Description 🌐 Link / Info
🌍 Review application ra-23
🔑 Sign-in URL Click to sign-in
📊 Application logs View logs
🔄 Actions Click to redeploy
🚀 ArgoCD deployment View deployment
Last deployed 2025-11-07 08:40:28 (UTC)
📜 Deployed commit c47eb74206aeb574189ed0bef14c2d7cbfaf190e
🛠️ Toolkit version e039330

@OlegWock
Copy link
Member Author

OlegWock commented Nov 7, 2025

You can test this with following Trino SQL query

SELECT CAST(ROW(1, 'Alice') AS ROW(id INTEGER, name VARCHAR)) AS user
UNION ALL
SELECT CAST(ROW(2, 'Bob')   AS ROW(id INTEGER, name VARCHAR)) AS user

Or by constructing DF manually

from trino.types import NamedRowTuple
import pandas as pd
import numpy as np

row1 = NamedRowTuple(values=[1, "Alice"], names=["id", "name"], types=["integer", "varchar"])
row2 = NamedRowTuple(values=[2, "Bob"], names=["id", "name"], types=["integer", "varchar"])

np_array = np.empty(2, dtype=object)
np_array[0] = row1
np_array[1] = row2
df = pd.DataFrame({"col1": np_array})

df

@OlegWock OlegWock marked this pull request as ready for review November 7, 2025 08:57
@OlegWock OlegWock requested a review from a team as a code owner November 7, 2025 08:57
@m1so m1so merged commit 663dce1 into main Nov 7, 2025
33 of 34 checks passed
@m1so m1so deleted the oleh/blu-5137-column-analysis-fails-on-trinotypesnamedrowtuple branch November 7, 2025 14:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants