-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Support to parse numbers in text-based input formats #17082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to parse numbers in text-based input formats #17082
Conversation
…arsers. This helps samplers to detect numeric types for text-based formats like csv and tsv. These text-based formats by default parse numbers as strings. This change add a config flag to optionally parse numbers as numbers. Long for integers and Double for floating-point numbers. It falls back to string if it cannot parse. The web-console has some code in the load data flow to parse the sample of data returned by the Druid sampler to further inspect types so it can convert them to specific numeric types, if applicable. After this change, the web-console sampler/other applications can just rely on Druid to do it.
processing/src/main/java/org/apache/druid/java/util/common/parsers/ParserUtils.java
Outdated
Show resolved
Hide resolved
@Nullable | ||
private static Object tryParseStringAsNumber(@Nullable final String input) | ||
{ | ||
if (!NumberUtils.isNumber(input)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wonder if this is worth looping over the string an extra time before we do try parse attempts, or if we should just start with trying to parse it as a long. I guess having this function call saves the double tryParse which uses a regex pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I considered something like that. However, it adds an additional overhead of regex that you note for string inputs, so I kept the current approach, which is optimized for non-numeric strings
Conflict in sql/src/test/java/org/apache/druid/sql/calcite/IngestTableFunctionTest.java
Text-based input formats like
csv
andtsv
currently parse inputs only as strings, following theRFC4180Parser
spec).To workaround this, the web-console and other tools need to further inspect the sample data returned to sample data returned by the Druid sampler API to parse them as numbers. See here for the relevant web-console code.
Changes:
tryParseNumbers
for thecsv
andtsv
input formats.Key classes to review:
ParserUtils
CsvInputFormat
CsvParser
DelimitedInputFormat
DelimitedValueReader
Release note:
Introduce a new optional config,
tryParseNumbers
, for thecsv
andtsv
input formats.If enabled, any numbers present in the input will be parsed in the following manner --long
data type for integer types anddouble
for floating-point numbers. By default, this configuration is set to false, so numeric strings will be treated as strings.