Support Parquet unsigned integer types#149405
Conversation
|
Hi @swallez, I've created a changelog YAML for you. |
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
🔍 Preview links for changed docs⏳ Building and deploying preview... View progress This comment will be updated with preview links when the build is complete. |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
Summary
Parquet supports unsigned integer annotations (
UINT_8,UINT_16,UINT_32,UINT_64) on top of its 32-bit and 64-bit physical integer types. Before this change, these annotations were ignored and the values were read as signed integers, causing data corruption for values that exceed the signed range (e.g., aUINT_32value of3,000,000,000would be read as-1,294,967,296).Type mapping (
ParquetFormatReader):INT32with unsigned annotation and 32-bit width →LONG, to hold the full[0, 2^32)unsigned rangeINT32with unsigned annotation and smaller width (8 or 16 bits) →INTEGER, since values fit within the signed int rangeINT64with unsigned annotation →UNSIGNED_LONGValue reading (
PageColumnReader,ParquetFormatReader):UNSIGNED_LONGis now handled alongsideLONGin the INT32-backed dispatch pathsInteger.toUnsignedLong()so the bit pattern is preserved rather than sign-extendedTests (
ParquetFormatReaderTests):testLargeUint32:0xFFFFFFFFin aUINT_32column maps to typeLONGand reads as4294967295testLargeUint8: value200in aUINT_8column maps toINTEGERand reads correctlytestLargeUnsignedLong:0xFFFFFFFFFFFFFFFFLin aUINT_64column maps toUNSIGNED_LONGand round-trips correctlyThe dispatch logic between
PageColumnReader#readBatchandParquetFormatReader#readColumnBlockremains duplicated; aKEEP IN SYNCwarning comment was added to both sites.Fixes https://github.com/elastic/esql-planning/issues/316