Skip to content

Conversation

abhishekrb19
Copy link
Contributor

@abhishekrb19 abhishekrb19 commented Sep 16, 2024

Text-based input formats like csv and tsv currently parse inputs only as strings, following the RFC4180Parser spec).
To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers. See here for the relevant web-console code.

Changes:

  • Introduce a new optional config tryParseNumbers for the csv and tsv input formats.
  • If enabled, the input format reader will parse numeric values from the input string for these text-based input formats.
  • By default, this configuration is false for backwards compatibility
  • Note that when enabled, this can have a small performance hit as the input needs to be parsed to determined the numeric type.

Key classes to review:

  • ParserUtils
  • CsvInputFormat
  • CsvParser
  • DelimitedInputFormat
  • DelimitedValueReader

Release note:

Introduce a new optional config, tryParseNumbers, for the csv and tsv input formats.If enabled, any numbers present in the input will be parsed in the following manner -- long data type for integer types and double for floating-point numbers. By default, this configuration is set to false, so numeric strings will be treated as strings.

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • been tested in a test Druid cluster.

…arsers.

This helps samplers to detect numeric types for text-based formats like
csv and tsv.

These text-based formats by default parse numbers as strings.
This change add a config flag to optionally parse numbers as numbers.
Long for integers and Double for floating-point numbers.
It falls back to string if it cannot parse.

The web-console has some code in the load data flow to parse the sample of data
returned by the Druid sampler to further inspect types so it can convert them
to specific numeric types, if applicable.

After this change, the web-console sampler/other applications can just rely
on Druid to do it.
@Nullable
private static Object tryParseStringAsNumber(@Nullable final String input)
{
if (!NumberUtils.isNumber(input)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if this is worth looping over the string an extra time before we do try parse attempts, or if we should just start with trying to parse it as a long. I guess having this function call saves the double tryParse which uses a regex pattern.

Copy link
Contributor Author

@abhishekrb19 abhishekrb19 Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I considered something like that. However, it adds an additional overhead of regex that you note for string inputs, so I kept the current approach, which is optimized for non-numeric strings

Conflict in
sql/src/test/java/org/apache/druid/sql/calcite/IngestTableFunctionTest.java
@abhishekrb19 abhishekrb19 merged commit 635e418 into apache:master Sep 19, 2024
90 checks passed
@abhishekrb19 abhishekrb19 deleted the text_format_parse_numbers branch September 19, 2024 20:21
@adarshsanjeev adarshsanjeev added this to the 32.0.0 milestone Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants