-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix parsing of non-finite values #3942
Conversation
…datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero.
…data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array)
…urns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked.
@AlbertoEAF Could you please help to review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
-
C++98 does not define aspect of inf/nan values in text, but C99 does, and so, unlike
stod
, apparently stringstreams might not work in all platforms (specially MSVC) to parseinf
, andnan
values: https://www.boost.org/doc/libs/1_70_0/libs/math/doc/html/math_toolkit/fp_facets/facets_intro.html -
The old implementation used
stod
(which is not locale-safe for our purposes): https://en.cppreference.com/w/cpp/string/basic_string/stof. See the old version of of this very function here: 792c930#diff-ddbdb79f73cc159561b86b759b0542fec75ca9f4869a7eac5790e8f9ebaf7ef4R382 -
We shouldn't stop parsing numbers when
fast_double_parser::parse_number
fails. There might be some numerical values that besides inf, or nan which are not parsed either. Hence we should keep the stringstream parsing for numerical values on the slow path instead of giving a fatal error.
std::stringstream ss; | ||
Common::C_stringstream(ss); | ||
ss << str; | ||
ss >> tmp; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep this code on the else
branch instead of raising a fatal error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that code silently fails, which is completely unacceptable.
In reality we are only parsing strings generated by LightGBM itself when the model was written out to file, so we should ensure there are robust round-trip tests which include models that contain inf and nan values. There are no other possible non-finite values defined for IEEE 754 floating point numbers, and if at some point in the future the standard changed, this would be quickly picked up by the round-trip tests and easily addressed.
I am shocked that such tests don't already exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant doing number parsing after the nan and inf checks. Your logic makes sense regarding IEEE-754, although I wouldn't recommend dropping that parsing at the end without adding said tests first, not 100.0% sure fast_double_parser parses all numbers.
tmp = std::numeric_limits<double>::infinity(); | ||
else if (strlower == std::string("-inf")) | ||
tmp = -std::numeric_limits<double>::infinity(); | ||
else if (strlower == std::string("nan")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Important: Missing -nan handling.
Probably best to halve the string comparisons by parsing first the "-" sign.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok thanks, I'll add -nan. I don't think it's worth obfuscating this rarely executed branch with optimisations until profiling shows it is a bottleneck.
ss << str; | ||
ss >> tmp; | ||
std::string strlower(str); | ||
std::transform(strlower.begin(), strlower.end(), strlower.begin(), [](int c) -> char { return static_cast<char>(::tolower(c)); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great clean code @mjmckp ;)
Instead of allocating a string, and since you already have a lambda, what about defining a case-insensitive comparison lambda and use std::equal
to check the "inf" and "nan" values below?
Although this is the rare branch there might be longer strings than inf or nan which might be parsed here and might slow down our parsing without need.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this hardly seems worth it, this branch is rarely invoked, meanwhile a colossal amount of strings are being allocated in splitting and parsing the input file, so these few extra allocations are a drop in the ocean.
I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think our time is better spent adding robust round-trip tests to ensure major bugs like this don't occur again...
Agreed. Will you add such tests?
I actually run such tests but on an external lgbm provider and didn't have nan nor inf on my model, but would prefer to see them in lgbm's CI so any breakage is detected immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could assist in adding the tests, any idea where these should go and how best to implement them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, would you mind pointing me towards a similar kind of test that I can use as a starting point please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, we don't have any tests yet. This is something that we should concentrate on in the near future. For now, I think you can take a look at tests from @AlbertoEAF in #3997.
https://github.com/microsoft/LightGBM/pull/3997/files#diff-c363eba6eda99d9e560f8341a1fc8fe02e885d2256db2482e1c543430a25666d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mjmckp I've merged this PR with the aim to not delay the upcoming release. Please feel free to add tests in a new PR. We'll be very grateful! And thanks a lot for the bug fix!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StrikerRUS Thanks a lot, I'll get up to speed on how the new tests work and add some tests for this in a new PR soon.
Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@shiyu1994 @btrotta @guolinke Could you please review this to include in the upcoming release? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* [docs]Add alt text on images * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <jaylamb20@gmail.com> * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <jaylamb20@gmail.com> * Apply suggestions from code review Co-authored-by: James Lamb <jaylamb20@gmail.com> * Apply suggestions from code review Co-authored-by: James Lamb <jaylamb20@gmail.com> * Merge main branch commit updates (#1) * [docs] Add alt text to image in Parameters-Tuning.rst (#4035) * [docs] Add alt text to image in Parameters-Tuning.rst Add alt text to Leaf-wise growth image, as part of #4028 * Update docs/Parameters-Tuning.rst Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: James Lamb <jaylamb20@gmail.com> * [ci] [R-package] upgrade to R 4.0.4 in CI (#4042) * [docs] update description of deterministic parameter (#4027) * update description of deterministic parameter to require using with force_row_wise or force_col_wise * Update include/LightGBM/config.h Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update docs Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [dask] Include support for init_score (#3950) * include support for init_score * use dataframe from init_score and test difference with and without init_score in local model * revert refactoring * initial docs. test between distributed models with and without init_score * remove ranker from tests * test value for root node and change docs * comma * re-include parametrize * fix incorrect merge * use single init_score and the booster_ attribute * use np.float64 instead of float * [ci] ignore untitle Jupyter notebooks in .gitignore (#4047) * [ci] prevent getting incompatible dask and distributed versions (#4054) * [ci] prevent getting incompatible dask and distributed versions * Update .ci/test.sh Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * empty commit Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055) * [ci] fix R CMD CHECK note about example timings (fixes #4049) * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * empty commit Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [ci] add CMake + R 3.6 test back (fixes #3469) (#4053) * [ci] add CMake + R 3.6 test back (fixes #3469) * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Update .ci/test_r_package_windows.ps1 * -Wait and remove rtools40 * empty commit Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [dask] include multiclass-classification task in tests (#4048) * include multiclass-classification task and task_to_model_factory dicts * define centers coordinates. flatten init_scores within each partition for multiclass-classification * include issue comment and fix linting error * Update index.rst (#4029) Add alt text to logo image Co-authored-by: James Lamb <jaylamb20@gmail.com> * [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059) * [dask] raise more informative error for duplicates in 'machines' * uncomment * avoid test failure * Revert "avoid test failure" This reverts commit 9442bdf. * [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030) * [dask] add tutorial documentation (fixes #3814, fixes #3838) * add notes on saving the model * quick start examples * add examples * fix timeouts in examples * remove notebook * fill out prediction section * table of contents * add line back * linting * isort * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * move examples under python-guide * remove unused pickle import Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * set 'pending' commit status for R Solaris optional workflow (#4061) * [docs] add Yu Shi to repo maintainers (#4060) * Update FAQ.rst * Update CODEOWNERS * set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) * Add CMake option to enable sanitizers and build gtest (#3555) * Add CMake option to enable sanitizer * Set up gtest * Address reviewer's feedback * Address reviewer's feedback * Update CMakeLists.txt Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * added type hint (#4070) * [ci] run Dask examples on CI (#4064) * Update Parallel-Learning-Guide.rst * Update test.sh * fix path * address review comments * [python-package] add type hints on Booster.set_network() (#4068) * [python-package] add type hints on Booster.set_network() * change behavior * [python-package] Some mypy fixes (#3916) * Some mypy fixes * address James' comments * Re-introduce pass in empty classes * Update compat.py Remove extra lines * [dask] [ci] fix flaky network-setup test (#4071) * [tests][dask] simplify code in Dask tests (#4075) * simplify Dask tests code * enable CI * disable CI * Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076) This reverts commit 4e9c976. * Fix parsing of non-finite values (#3942) * Fix index out-of-range exception generated by BaggingHelper on small datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero. * Update goss.hpp * Update goss.hpp * Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array) * Fix incorrect upstream merge * Add link to LightGBM.NET * Fix indenting to 2 spaces * Dummy edit to trigger CI * Dummy edit to trigger CI * remove duplicate functions from merge * Fix parsing of non-finite values. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked. * Dummy commit to trigger CI * Also handle -nan in double parsing method * Update include/LightGBM/utils/common.h Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com> Co-authored-by: Guolin Ke <guolin.ke@outlook.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * [dask] remove unused imports from typing (#4079) * Range check for DCG position discount lookup (#4069) * Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data. * Change debug logging location so that we can print the data file name as well. * Revert "Change debug logging location so that we can print the data file name as well." This reverts commit 3981b34. * Add data file name to debug logging. * Move log line to a place where it is output even when query IDs are read from a separate file. * Also add the out-of-range check to rank metrics. * Perform check after number of queries is initialized. * Update * [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084) * [ci] install additional LaTeX packages in R CI jobs * update autoconf version * bump upper limit on package size to 100 * [SWIG] Add streaming data support + cpp tests (#3997) * [feature] Add ChunkedArray to SWIG * Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers * Address some review comments * Fix linting issues * Move test to tests/test_ChunkedArray_manually.cpp * Add test note * Move ChunkedArray to include/LightGBM/utils/ * Declare more explicit types of ChunkedArray in the SWIG API. * Port ChunkedArray tests to googletest * Please C++ linter * Address StrikerRUS' review comments * Update SWIG doc & disable ChunkedArray<int64_t> * Use CHECK_EQ instead of assert * Change include order (linting) * Rename ChunkedArray -> chunked_array files * Change header guards * Address last comments from StrikerRUS * store all CMake files in one place (#4087) * v3.2.0 release (#3872) * Update VERSION.txt * update appveyor.yml and configure * fix Appveyor builds Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: StrikerRUS <nekit94-12@hotmail.com> * [ci] Bump version for development (#4094) * Update .appveyor.yml * Update cran-comments.md * Update VERSION.txt * update configure Co-authored-by: James Lamb <jaylamb20@gmail.com> * [ci] fix flaky Azure Pipelines jobs (#4095) * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update setup.sh * Update setup.sh Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com> Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: jmoralez <jmoralz92@gmail.com> Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com> Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu> Co-authored-by: Deddy Jobson <dedjob@hotmail.com> Co-authored-by: Alberto Ferreira <AlbertoEAF@users.noreply.github.com> Co-authored-by: mjmckp <mjmckp@users.noreply.github.com> Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com> Co-authored-by: Guolin Ke <guolin.ke@outlook.com> Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com> Co-authored-by: StrikerRUS <nekit94-12@hotmail.com> * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: James Lamb <jaylamb20@gmail.com> Co-authored-by: Subham Agrawal <34346812+subhamagrawal7@users.noreply.github.com> Co-authored-by: shiyu1994 <shiyu_k1994@qq.com> Co-authored-by: Nikita Titov <nekit94-08@mail.ru> Co-authored-by: jmoralez <jmoralz92@gmail.com> Co-authored-by: marcelonieva7 <72712805+marcelonieva7@users.noreply.github.com> Co-authored-by: Philip Hyunsu Cho <chohyu01@cs.washington.edu> Co-authored-by: Deddy Jobson <dedjob@hotmail.com> Co-authored-by: Alberto Ferreira <AlbertoEAF@users.noreply.github.com> Co-authored-by: mjmckp <mjmckp@users.noreply.github.com> Co-authored-by: matthew-peacock <matthew.peacock@whiteoakam.com> Co-authored-by: Guolin Ke <guolin.ke@outlook.com> Co-authored-by: ashok-ponnuswami-msft <57648631+ashok-ponnuswami-msft@users.noreply.github.com> Co-authored-by: StrikerRUS <nekit94-12@hotmail.com>
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Proposed solution for #3941. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked.