-
Notifications
You must be signed in to change notification settings - Fork 344
Replace regex-based date/time parsing with manual parser #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks! The results look really cool! I took a first glance at the code (nice!), but will need a bit longer to go through it in more detail. |
@rbock |
6567a9e
to
5667c6c
Compare
I ran |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code in more detail. LGTM with small comments.
Can you add unit tests for the new functions?
5667c6c
to
0895eef
Compare
test_date_time_parser_2.cpp.txt I am attaching the new version of the test program which shows that the new code is still marginally faster than the old manual parser by 1-2% in debug mode and 3x-8x times faster in release mode. Tomorrow I will add the tests for the new functions and then I guess the PR will be ready for the final review. |
c51dd01
to
848151e
Compare
@rbock I also added tests for the date/time parser functions
I see that the date/time parser tests fail on the server with Visual C++. I will investigate it later today, will provide a fix and then will update the PR. |
There are times when a database might consider a date to be valid when it is actually not a valid calendar date. |
@Erroneous1
For the first case (values which cannot be mapped to a valid time point or duration): There are essentially two possible approaches to handling invalid date/time values:
Regarding the second type of invalid values, i.e. database engine-specific valid values, I think that it makes sense to adjust the parser code and handle all or at least most of them. Ideally we should be able to handle all database-specific timestamps as long as they fit into a |
The problem is actually not a bug in Visual C++, but an integer overflow in an variable of type
|
848151e
to
1bfd858
Compare
@rbock |
@Erroneous1 has a good point, though. Databases (and the respective applications) might consider dates valid that the current checks would not. I once worked in a project that considered the 31st of February valid (made some math easier). Thanks for the reminder! The original implementation accepted everything as long as it contained digits in the right places. No checks beyond that. If it did not find digits in the right places, it yielded a default constructed date/date_time. IIRC, that approach did not cause any problems in almost 10 years. We should therefore remove the checks for validity of dates and times, IMO. |
1bfd858
to
af76525
Compare
@rbock |
Thanks for the updates! I think we are good to go at this point. If you agree, I would merge probably tomorrow. |
@rbock Regarding the merge, I am OK with it. If I remember correctly, the new manual parser behaves almost identically to the old manual parser (the pre-regex one), the difference being that when parsing dates/times the old parser required the parsed string to have a valid prefix representing a date/time and then used that prefix for the resulting time_point or duration. That is any trailing garbage was ignored. On the other hand the new parser requires an exact match, i.e. any trailing garbage will cause the parse to fail and as a result will be replaced by the zero date/time. I think that the new behavior is correct, because any unrecognized 'garbage' can change significantly the interpretation of the resulting value. For example PostgreSQL uses the In my opinion sooner or later we should add support for parsing BC dates and other year ranges outside of 0000-9999, but for now anything falling outside of this year range should just be treated as |
Done! Thanks for your diligence! Agreed, it makes sense to add support for an extended date range. |
This PR replaces the regex date/time parsing code with a manually written parser which provides a speed-up of the parsing code. The speed-up is about 20 times when built without any optimizations and about 120 times when built with all optimizations.
While investigating this bug #519 I noticed that the bug shows up about twice more often with the regex-based date/time parser than with the old date/time parsing code. It was a hint that the regex-based code is noticeably slower than the old manual parser.
So I decided to write a test program that compares three different parser implementations:
The test that I wrote runs a loop 500,000 (half a million) times. Inside each loop it parses a date+time string (full timestamp) and a date string (YYYY-MM-DD). For each parser implementation the time of the loop is measured and then written to the standard output.
I am providing the test results for three compilers, g++, clang++, visual c++. For each compiler the debug (non-optimized) and release (full optimizations) builds are measured separately. All result durations are in seconds.
The most significant findings from these measurements are:
std::atoi
in the old manual parser. Most likely the custom code inlines much better thanstd::atoi
.However it is worth noting that even though the regex parser is much slower than the manual parser this does not lead to the same overall slowdown of the code that used sqlpp to read date/time values from the database. For example the mysql date/time test executable was running at about 10% slower when using the regex parser versus the old manual parser. Obviously the reason for this small overall delay is that not all time is spent parsing date/time strings like in the test program.
Still I think that this difference in performance makes it worthwhile to replace the regex parser with the new manual parser.
I am also attaching the test program that I wrote in order to test the speed of the different date/time parsing implementations. The program needs Howard Hinnant's date/time library in order to compile.
test_date_time_parser.cpp.txt