Skip to content

Excel Date Cells Are Inconsistently Inferred As Strings or Serial Numbers in ST_READ #162

@manickmanick

Description

@manickmanick

When reading an .xlsx file using ST_READ() from the duckdb npm package , the date column is inconsistently interpreted.

  • Some dates are returned as strings (e.g., '22/05/2000')
  • Others are returned as Excel serial numbers (e.g., '37349' for '04/03/2002')

Environment :
npm package :
duckdb - 1.3.1
duckdb-async - 1.3.1

Platform - macos
node version - v18.20.8
Excel Editor: Microsoft Excel on macOS

My node js code below :

duckDbConnection.executeQuery( CREATE TABLE ${excelTempTableName} AS SELECT * FROM ST_READ('${filePath}', layer='${sheet}',
open_options=['HEADERS=FORCE','CELL_TEXT=YES'], allowed_drivers=["XLSX"]), )

steps to reproduce :

  1. create an xlsx file
Image
  1. Upload the excel and try to read the excel using duckdb npm package like the code i shared above.

  2. Observed output :

Image

so when i write below query to get the type of data for that column, i receive a bigint type.

const res = await duckDbConnection.executeQuery(SELECT * FROM ${excelTempTableName}); console.log('table ============> ', res);

  1. Expected output :

All date1 values should be consistently interpreted as proper date type.

I tried below things to fix :

  1. I formatted the date1 column as date in excel file but when i upload the file then that date1 column is interpreted as varchar type.
    [
    {
    column_name: 'id',
    column_type: 'INTEGER',
    min: '1',
    max: '7',
    approx_unique: 5n,
    avg: '3.6',
    std: '2.4083189157584592',
    q25: '2',
    q50: '3',
    q75: '6',
    count: 5n,
    null_percentage: 0
    },
    {
    column_name: 'date1',
    column_type: 'VARCHAR',
    min: '22/05/2000',
    max: '38689',
    approx_unique: 5n,
    avg: null,
    std: null,
    q25: null,
    q50: null,
    q75: null,
    count: 5n,
    null_percentage: 0
    }
    ]

Request :

  1. Consistently infer date columns (preferably to DATE or ISO string format)
  2. Provide an option like PARSE_DATES=TRUE to always convert serials to readable date strings
  3. Respect CELL_TEXT=YES to return visible Excel values, not underlying serials

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions