486-binary-excel-connector #841

mborodii-prog · 2025-12-02T17:10:47Z

XLSB Connector specification

Overview

Implements dedicated connector for reading Binary Excel (.xlsb) files, providing significantly faster performance and smaller file sizes compared to standard .xlsx format.

Usage Examples

Reading XLSB Files

import wrangles

# Basic XLSB read
df = wrangles.connectors.xlsb.read('data.xlsb')

# With column selection
df = wrangles.connectors.xlsb.read('data.xlsb', columns=['col1', 'col2'])

# With specific sheet
df = wrangles.connectors.xlsb.read('data.xlsb', sheet_name='Sheet1'])

# With skiprows
df = wrangles.connectors.xlsb.read('data.xlsb', skiprows=5)
df = wrangles.connectors.xlsb.read('data.xlsb', skiprows=[0, 2, 5])

# With nrows
df = wrangles.connectors.xlsb.read('data.xlsb', nrows=5)

# With header
df = wrangles.connectors.xlsb.read('data.xlsb', header=0)

# Large File Processing
df = wrangles.connectors.xlsb.read('large_file.xlsb', chunksize=10000)

# Data Transformation
df = wrangles.connectors.xlsb.read('data.xlsb',   
    converters={'Price': 'lambda x: float(x.replace("$", ""))'}  
)

Using in Recipes

# Read XLSB
read:  
  - xlsb:  
      name: data.xlsb  

# Read XLSB with specific sheet
read:  
  - xlsb:  
      name: data.xlsb  
      sheet_name: Basic

# Read XLSB with columns list
read:  
  - xlsb:  
      name: tests/samples/data.xlsb  
      columns:  
        - Col1
        - Col2
        - Col3?

### Data Pipeline Integration

read:  
  xlsb:  
    name: inventory_data.xlsb  
    chunksize: 10000  
    converters:  
      Product_Code: "lambda x: x[:3] + '-' + x[3:]"  
      Price: "lambda x: float(x.replace('$', ''))"  
      
wrangles:  
  - convert.case:  
      input: Product_Name  
      case: upper  
  - filter.where:  
      where: Price > 0  
  
write:  
  file:  
    name: cleaned_inventory.csv

Explicit Connector Parameters

All parameters are explicitly documented and validated:

Read Parameters:

name - Name of the file to import h (required)
columns - Subset of the columns to be read
nrows - Number of rows to read
sheet_name - Name or index of sheet to read. Default 0 (first sheet)
header - Row(s) to use as column names. Default 0
names - List of column names to use
index_col - Column(s) to use as row labels
dtype - Data type for columns
skiprows - Rows to skip at beginning
na_values - Additional strings to recognize as NA/NaN
keep_default_na - Whether to keep default NA values
na_filter - Detect missing value markers
parse_dates - Parse date columns
date_format - Format to use for parsing dates
thousands - Thousands separator
decimal - Decimal separator
comment - Comment character
skipfooter - Rows to skip at end
chunksize - Number of rows to read at a time for large files
max_memory_mb - Maximum memory usage in MB before switching to chunked mode

Notes
Read-only connector (writing .xlsb files not supported by available Python libraries)
Requires pyxlsb>=1.0.10 dependency
Automatically enables chunked mode for large files based on max_memory_mb parameter

Potential Enhancements for XLSB Connector

multisheet processing
keeping formulas

lmolotii · 2025-12-04T13:53:54Z

Requested improvements styling

It is better to format the description of the connector. The example is placed here: #827 Add implementation for the new JSON connector #835
When you create a PR, please use appropriate labels

Unit tests

I want to ask to implement unit tests for the following scenarios:

Error Handling Tests (File not found scenario, Corrupted/invalid .xlsb file, Invalid sheet name/index, Non-existent columns in columns parameter, Empty file (0 rows, zero columns), File with no headers (header=None case).
Edge Case Tests (Reading multiple sheets (sheet_name=None or sheet_name=[0,1]), files with special characters in column names, files with duplicate column names, files with merged cells, files with only headers (0 data rows), files with formulas and calculated values.
Data Type Tests (mixed data types in columns, date/datetime columns with various formats, numeric precision (floats, decimals), boolean values (True/False, 1/0, Yes/No), Empty cells vs NULL vs "NULL" string, large numbers (beyond int32 range), unicode and special characters.
Parameter Combination Tests (columns + usecols interaction (are they mutually exclusive?), skiprows + header interaction, nrows + skipfooter interaction, names parameter with the wrong number of columns, converters with invalid column names, index_col with multi-index)
Wildcard/Regex Column Selection (columns with wildcard patterns (e.g., "Col*"), columns with regex patterns (e.g., "regex:Col[0-9]+"), columns with negative patterns (e.g., "-Ignore*"), columns with optional pattern (e.g., "OptionalCol?"),
File Object vs Path Tests (reading from file_object (BytesIO), reading from different path types (relative, absolute), reading from file-like objects)
Integration Tests (Using xlsb connector within a recipe with wrangles, comparing results with regular xlsx file (same data)
Specific pandas tests (true_values and false_values with custom values, converters with lambda functions, dtype as dict for per-column types, parse_dates as a list of columns, comment parameter filtering)

lmolotii

Please review comments in the review section to address the critical issues related to the connector.

wrangles/connectors/xlsb.py

…ector

mborodii-prog · 2025-12-16T15:02:28Z

@ebhills @thomasstvr Pls review updated binary excel connector

ebhills · 2025-12-16T16:32:56Z

@mborodii-prog - let's discuss why this can not just be handled by the excel connector. Also, it looks like you are adding an old library (xlsb).

486-binary-excel-connector

23379c3

mborodii-prog requested review from ebhills and thomasstvr December 2, 2025 17:10

mborodii-prog linked an issue Dec 2, 2025 that may be closed by this pull request

Binary Excel Connector #486

Open

486-binary-excel-connector

39192bc

lmolotii requested changes Dec 4, 2025

View reviewed changes

test fix

1c1735d

mborodii-prog marked this pull request as draft December 9, 2025 09:06

486-binary-excel-connector

728d9de

mborodii-prog added the connector label Dec 10, 2025

mborodii-prog added 6 commits December 10, 2025 22:13

486-binary-excel-connector

cbc2e09

Merge remote-tracking branch 'origin/main' into 486-binary-excel-conn…

5898bcb

…ector

Add implementation for the new XLSB connector, add corresponding tests

5a6e479

486-binary-excel-connector

5e7be50

486-binary-excel-connector

67048e6

486-binary-excel-connector

0a3c9e9

mborodii-prog marked this pull request as ready for review December 16, 2025 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

486-binary-excel-connector #841

486-binary-excel-connector #841

Uh oh!

mborodii-prog commented Dec 2, 2025 •

edited

Loading

Uh oh!

lmolotii commented Dec 4, 2025

Uh oh!

lmolotii left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mborodii-prog commented Dec 16, 2025

Uh oh!

ebhills commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

486-binary-excel-connector #841

Are you sure you want to change the base?

486-binary-excel-connector #841

Uh oh!

Conversation

mborodii-prog commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

XLSB Connector specification

Overview

Usage Examples

Reading XLSB Files

Using in Recipes

Explicit Connector Parameters

Uh oh!

lmolotii commented Dec 4, 2025

Requested improvements styling

Unit tests

Uh oh!

lmolotii left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mborodii-prog commented Dec 16, 2025

Uh oh!

ebhills commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mborodii-prog commented Dec 2, 2025 •

edited

Loading