-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing of non-finite values #3942
Changes from 19 commits
c87b8e2
48b6e2c
182e924
7d81cdb
c3ab42f
fa06bf6
fbf8637
84c3af8
af9c942
c75f756
f81ed85
25a7a29
5562014
e29ae45
6272a30
fd2a504
108c6de
be702c0
1aea9d9
4f06459
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1082,12 +1082,20 @@ struct __StringToTHelper<T, true> { | |
// Fast (common) path: For numeric inputs in RFC 7159 format: | ||
const bool fast_parse_succeeded = fast_double_parser::parse_number(str.c_str(), &tmp); | ||
|
||
// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc. Fallback to standard library: | ||
// Rare path: Not in RFC 7159 format. Possible "inf", "nan", etc. | ||
if (!fast_parse_succeeded) { | ||
std::stringstream ss; | ||
Common::C_stringstream(ss); | ||
ss << str; | ||
ss >> tmp; | ||
std::string strlower(str); | ||
mjmckp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); }); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great clean code @mjmckp ;) Instead of allocating a string, and since you already have a lambda, what about defining a case-insensitive comparison lambda and use Although this is the rare branch there might be longer strings than inf or nan which might be parsed here and might slow down our parsing without need. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, this hardly seems worth it, this branch is rarely invoked, meanwhile a colossal amount of strings are being allocated in splitting and parsing the input file, so these few extra allocations are a drop in the ocean. I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Agreed. Will you add such tests? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I could assist in adding the tests, any idea where these should go and how best to implement them? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, would you mind pointing me towards a similar kind of test that I can use as a starting point please? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unfortunately, we don't have any tests yet. This is something that we should concentrate on in the near future. For now, I think you can take a look at tests from @AlbertoEAF in #3997. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mjmckp I've merged this PR with the aim to not delay the upcoming release. Please feel free to add tests in a new PR. We'll be very grateful! And thanks a lot for the bug fix! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @StrikerRUS Thanks a lot, I'll get up to speed on how the new tests work and add some tests for this in a new PR soon. |
||
if (strlower == std::string("inf")) | ||
tmp = std::numeric_limits<double>::infinity(); | ||
else if (strlower == std::string("-inf")) | ||
tmp = -std::numeric_limits<double>::infinity(); | ||
else if (strlower == std::string("nan")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Important: Missing -nan handling. Probably best to halve the string comparisons by parsing first the "-" sign. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok thanks, I'll add -nan. I don't think it's worth obfuscating this rarely executed branch with optimisations until profiling shows it is a bottleneck. |
||
tmp = std::numeric_limits<double>::quiet_NaN(); | ||
else if (strlower == std::string("-nan")) | ||
tmp = -std::numeric_limits<double>::quiet_NaN(); | ||
else | ||
Log::Fatal("Failed to parse double: %s", str.c_str()); | ||
} | ||
|
||
return static_cast<T>(tmp); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this code on the
else
branch instead of raising a fatal error.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that code silently fails, which is completely unacceptable.
In reality we are only parsing strings generated by LightGBM itself when the model was written out to file, so we should ensure there are robust round-trip tests which include models that contain inf and nan values. There are no other possible non-finite values defined for IEEE 754 floating point numbers, and if at some point in the future the standard changed, this would be quickly picked up by the round-trip tests and easily addressed.
I am shocked that such tests don't already exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant doing number parsing after the nan and inf checks. Your logic makes sense regarding IEEE-754, although I wouldn't recommend dropping that parsing at the end without adding said tests first, not 100.0% sure fast_double_parser parses all numbers.