Skip to content

Conversation

@mborodii-prog
Copy link
Contributor

@mborodii-prog mborodii-prog commented Dec 2, 2025

XLSB Connector specification

Overview

Implements dedicated connector for reading Binary Excel (.xlsb) files, providing significantly faster performance and smaller file sizes compared to standard .xlsx format.

Usage Examples

Reading XLSB Files

import wrangles

# Basic XLSB read
df = wrangles.connectors.xlsb.read('data.xlsb')

# With column selection
df = wrangles.connectors.xlsb.read('data.xlsb', columns=['col1', 'col2'])

# With specific sheet
df = wrangles.connectors.xlsb.read('data.xlsb', sheet_name='Sheet1'])

# With skiprows
df = wrangles.connectors.xlsb.read('data.xlsb', skiprows=5)
df = wrangles.connectors.xlsb.read('data.xlsb', skiprows=[0, 2, 5])

# With nrows
df = wrangles.connectors.xlsb.read('data.xlsb', nrows=5)

# With header
df = wrangles.connectors.xlsb.read('data.xlsb', header=0)

# Large File Processing
df = wrangles.connectors.xlsb.read('large_file.xlsb', chunksize=10000)

# Data Transformation
df = wrangles.connectors.xlsb.read('data.xlsb',   
    converters={'Price': 'lambda x: float(x.replace("$", ""))'}  
)

Using in Recipes

# Read XLSB
read:  
  - xlsb:  
      name: data.xlsb  

# Read XLSB with specific sheet
read:  
  - xlsb:  
      name: data.xlsb  
      sheet_name: Basic

# Read XLSB with columns list
read:  
  - xlsb:  
      name: tests/samples/data.xlsb  
      columns:  
        - Col1
        - Col2
        - Col3?

### Data Pipeline Integration

read:  
  xlsb:  
    name: inventory_data.xlsb  
    chunksize: 10000  
    converters:  
      Product_Code: "lambda x: x[:3] + '-' + x[3:]"  
      Price: "lambda x: float(x.replace('$', ''))"  
      
wrangles:  
  - convert.case:  
      input: Product_Name  
      case: upper  
  - filter.where:  
      where: Price > 0  
  
write:  
  file:  
    name: cleaned_inventory.csv

Explicit Connector Parameters

All parameters are explicitly documented and validated:

Read Parameters:

  • name - Name of the file to import h (required)
  • columns - Subset of the columns to be read
  • nrows - Number of rows to read
  • sheet_name - Name or index of sheet to read. Default 0 (first sheet)
  • header - Row(s) to use as column names. Default 0
  • names - List of column names to use
  • index_col - Column(s) to use as row labels
  • dtype - Data type for columns
  • skiprows - Rows to skip at beginning
  • na_values - Additional strings to recognize as NA/NaN
  • keep_default_na - Whether to keep default NA values
  • na_filter - Detect missing value markers
  • parse_dates - Parse date columns
  • date_format - Format to use for parsing dates
  • thousands - Thousands separator
  • decimal - Decimal separator
  • comment - Comment character
  • skipfooter - Rows to skip at end
  • chunksize - Number of rows to read at a time for large files
  • max_memory_mb - Maximum memory usage in MB before switching to chunked mode

Notes
Read-only connector (writing .xlsb files not supported by available Python libraries)
Requires pyxlsb>=1.0.10 dependency
Automatically enables chunked mode for large files based on max_memory_mb parameter

Potential Enhancements for XLSB Connector

  • multisheet processing
  • keeping formulas

@mborodii-prog mborodii-prog linked an issue Dec 2, 2025 that may be closed by this pull request
@lmolotii
Copy link

lmolotii commented Dec 4, 2025

Requested improvements styling

  1. It is better to format the description of the connector. The example is placed here: #827 Add implementation for the new JSON connector #835
  2. When you create a PR, please use appropriate labels

Unit tests

I want to ask to implement unit tests for the following scenarios:

  • Error Handling Tests (File not found scenario, Corrupted/invalid .xlsb file, Invalid sheet name/index, Non-existent columns in columns parameter, Empty file (0 rows, zero columns), File with no headers (header=None case).
  • Edge Case Tests (Reading multiple sheets (sheet_name=None or sheet_name=[0,1]), files with special characters in column names, files with duplicate column names, files with merged cells, files with only headers (0 data rows), files with formulas and calculated values.
  • Data Type Tests (mixed data types in columns, date/datetime columns with various formats, numeric precision (floats, decimals), boolean values (True/False, 1/0, Yes/No), Empty cells vs NULL vs "NULL" string, large numbers (beyond int32 range), unicode and special characters.
  • Parameter Combination Tests (columns + usecols interaction (are they mutually exclusive?), skiprows + header interaction, nrows + skipfooter interaction, names parameter with the wrong number of columns, converters with invalid column names, index_col with multi-index)
  • Wildcard/Regex Column Selection (columns with wildcard patterns (e.g., "Col*"), columns with regex patterns (e.g., "regex:Col[0-9]+"), columns with negative patterns (e.g., "-Ignore*"), columns with optional pattern (e.g., "OptionalCol?"),
  • File Object vs Path Tests (reading from file_object (BytesIO), reading from different path types (relative, absolute), reading from file-like objects)
  • Integration Tests (Using xlsb connector within a recipe with wrangles, comparing results with regular xlsx file (same data)
  • Specific pandas tests (true_values and false_values with custom values, converters with lambda functions, dtype as dict for per-column types, parse_dates as a list of columns, comment parameter filtering)

Copy link

@lmolotii lmolotii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review comments in the review section to address the critical issues related to the connector.

@mborodii-prog mborodii-prog marked this pull request as draft December 9, 2025 09:06
@mborodii-prog
Copy link
Contributor Author

@ebhills @thomasstvr Pls review updated binary excel connector

@mborodii-prog mborodii-prog marked this pull request as ready for review December 16, 2025 15:02
@ebhills
Copy link
Collaborator

ebhills commented Dec 16, 2025

@mborodii-prog - let's discuss why this can not just be handled by the excel connector. Also, it looks like you are adding an old library (xlsb).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Binary Excel Connector

4 participants