Skip to content

Replace regex-based date/time parsing with manual parser #520

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 7, 2023

Conversation

MeanSquaredError
Copy link
Contributor

@MeanSquaredError MeanSquaredError commented Aug 27, 2023

This PR replaces the regex date/time parsing code with a manually written parser which provides a speed-up of the parsing code. The speed-up is about 20 times when built without any optimizations and about 120 times when built with all optimizations.

While investigating this bug #519 I noticed that the bug shows up about twice more often with the regex-based date/time parser than with the old date/time parsing code. It was a hint that the regex-based code is noticeably slower than the old manual parser.

So I decided to write a test program that compares three different parser implementations:

  • The old manual parser, that was in use before the regex-based parser was introduced. This parser works well, but does not support parsing of timezone offsets which are used by PostgreSQL and also each connector had its own copy of the parsing code.
  • The current regex-based parser. It works well too and additionally supports parsing timezone offsets. However it is significantly slower than the old manual parser.
  • The new manual parser which I wrote for the test. It supports parsing of timezone offsets and surprisingly turned out not only to be 120 times faster than the regex parser but also 3-10 times faster than the old manual parser.

The test that I wrote runs a loop 500,000 (half a million) times. Inside each loop it parses a date+time string (full timestamp) and a date string (YYYY-MM-DD). For each parser implementation the time of the loop is measured and then written to the standard output.

I am providing the test results for three compilers, g++, clang++, visual c++. For each compiler the debug (non-optimized) and release (full optimizations) builds are measured separately. All result durations are in seconds.

g++ (GCC) 13.1.1 20230614, non-optimized build (-O0)
Old manual parser duration: 0.356939
Regex parser duration: 7.57215
New manual parser duration: 0.341122

g++ (GCC) 13.1.1 20230614, optimized build (-O3)
Old manual parser duration: 0.0873297
Regex parser duration: 0.923286
New manual parser duration: 0.00733141

clang++ 16.0.6, non-optimized build (-O0)
Old manual parser duration: 0.375026
Regex parser duration: 6.09965
New manual parser duration: 0.352535

clang++ 16.0.6, optimized build (-O3)
Old manual parser duration: 0.0886557
Regex parser duration: 0.944003
New manual parser duration: 0.0189532

Visual C++ 2019, debug build
Old manual parser duration: 2.20671
Regex parser duration: 301.234
New manual parser duration: 1.80888

Visual C++ 2019, release build
Old manual parser duration: 0.217817
Regex parser duration: 7.97533
New manual parser duration: 0.074008

The most significant findings from these measurements are:

  • Despite my initial expectations the regex parser is slow, about 20 times slower in debug builds and 120+ times in release build.
  • Surprisingly the new manual parser is noticeably faster than the old manual parser in release builds, ranging from about 10 times faster with g++, 4 times faster with clang++ and about 3 times faster with visual c++. I think the speed difference is mostly due to the custom implementation of string-to-integer conversion code in the new parser versus the use of std::atoi in the old manual parser. Most likely the custom code inlines much better than std::atoi.
  • Visual C++'s regex library seems to be really slow (over 300 seconds for the regex implementation vs about 2 seconds for the manual parser).

However it is worth noting that even though the regex parser is much slower than the manual parser this does not lead to the same overall slowdown of the code that used sqlpp to read date/time values from the database. For example the mysql date/time test executable was running at about 10% slower when using the regex parser versus the old manual parser. Obviously the reason for this small overall delay is that not all time is spent parsing date/time strings like in the test program.

Still I think that this difference in performance makes it worthwhile to replace the regex parser with the new manual parser.

I am also attaching the test program that I wrote in order to test the speed of the different date/time parsing implementations. The program needs Howard Hinnant's date/time library in order to compile.
test_date_time_parser.cpp.txt

@rbock
Copy link
Owner

rbock commented Aug 28, 2023

Thanks! The results look really cool!

I took a first glance at the code (nice!), but will need a bit longer to go through it in more detail.

@MeanSquaredError
Copy link
Contributor Author

@rbock
Sure, any suggestions on how to make the code cleaner or improve it in any other way are welcome!

@MeanSquaredError
Copy link
Contributor Author

I ran include/sqlpp11/detail/parse_date_time.h through clang-format and pushed the changes.

Copy link
Owner

@rbock rbock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code in more detail. LGTM with small comments.

Can you add unit tests for the new functions?

@MeanSquaredError
Copy link
Contributor Author

MeanSquaredError commented Sep 2, 2023

test_date_time_parser_2.cpp.txt
@rbock
I fixed the issues that you mentioned, but I also added a call to a custom function that check the validity of the year/month/day. I explained above why I think it makes sense to replace the call to ymd.ok() with a call to this function. So please review the new code too.

I am attaching the new version of the test program which shows that the new code is still marginally faster than the old manual parser by 1-2% in debug mode and 3x-8x times faster in release mode.

Tomorrow I will add the tests for the new functions and then I guess the PR will be ready for the final review.

@MeanSquaredError MeanSquaredError force-pushed the speed_up_date_time branch 2 times, most recently from c51dd01 to 848151e Compare September 3, 2023 14:26
@MeanSquaredError
Copy link
Contributor Author

MeanSquaredError commented Sep 3, 2023

@rbock
I updated the parser code by adding sanity checks for the parsed time of day, 0 <= hour <= 23, 0 <= minute <= 59, 0 <= seconds <= 59. Also I added a checks for the timezone offsets, 0 <= minute <= 59, 0 <= seconds <= 59. I did not add a check for the timezone hours because ISO 8601 does not say clearly the range for the tz hour offset.

I also added tests for the date/time parser functions and all tests pass successfully.

So I think that the PR is ready for a review.

I see that the date/time parser tests fail on the server with Visual C++. I will investigate it later today, will provide a fix and then will update the PR.

@Erroneous1
Copy link
Contributor

There are times when a database might consider a date to be valid when it is actually not a valid calendar date. 0000-00-00 is probably the most common example. Some databases might have different minimum and maximum values as well. It would make sense to me for sqlpp11's date parsing to only check for the most basic requirements and allow either the user to check for .ok() or for the connector to throw an error since it will also be checking as well. I believe Howard made checking if the date was valid a manual check because it is a little expensive. You have to check the day of month which is different for each month and February changes based on the year you're checking.

@MeanSquaredError
Copy link
Contributor Author

@Erroneous1
There are essentially two types of invalid values.

  • Values which cannot be sensibly or uniquely mapped to a valid time point or duration.
  • Values which can be mapped sensibly and uniquely to a valid time point or duration, but for some reason our code does not handle them correctly at the moment, e.g. they use some non-ISO 8601-compliant format which is specific to some database engine, e.g. PostgreSQL's negative dates that have the BC suffix.

For the first case (values which cannot be mapped to a valid time point or duration):
The date/time parsing is done in the context of batch operations, i.e. reading and converting multiple rows and/or multiple date/time fields. And our conversions functions always return types that are essentially std::chrono::time_point or std::chrono::duration, we don't support returning strings and strings don't really make sense in the context of batch operations.

There are essentially two possible approaches to handling invalid date/time values:

  • Throw/return an error for the whole batch operation. I think that this is not a very good approach.
  • Map the invalid value to some special value of time_point or duraton. For the mapping we could possibly call a user-defined callback or use some user-supplied flags supplied by the user that explain how to handle these special values. I am not sure if it makes sense to complicate things with user-supplied callbacks or special flags. Maybe it is better just to use custom queries receive the special values as text and then parse it outside of sqlpp. However if someone provides a PR which allows user-defined handling of these special value in sqlpp in a clean and simple way, I am sure that the sqlpp author will review and possibly include it.

Regarding the second type of invalid values, i.e. database engine-specific valid values, I think that it makes sense to adjust the parser code and handle all or at least most of them. Ideally we should be able to handle all database-specific timestamps as long as they fit into a sqlpp::chrono::day_point or sqlpp::chrono::microsecond_point. There is also the problem of compiler-specific bugs with large positive or negative time points but that is a different issue to tackle...

@MeanSquaredError
Copy link
Contributor Author

MeanSquaredError commented Sep 4, 2023

@rbock

The problem is actually not a bug in Visual C++, but an integer overflow in an variable of type std::chrono::minutes. In Visual C++ the int type is just 32 bits even in 64-bit mode, so it is really easy for std::chrono::minutes to overflow.

It seems that the test failure is due to a Visual C++ specific bug in operator+ when adding time points and durations. The following small program works correctly with g++ and fails with visual c++.

#include <chrono>
#include <iostream>

using namespace std::chrono;
using microsecond_point = std::chrono::time_point<std::chrono::system_clock, std::chrono::microseconds>;

int main()
{
	sys_days days {year{9999}/1/1};
	microsecond_point tp_1 {days};
	microsecond_point tp_2 {days + minutes{0}};

	std::cout << "tp_1 count: " << tp_1.time_since_epoch ().count () << "\n";
	std::cout << "tp_1 text: " << tp_1 << "\n";
	std::cout << "tp_2 count: " << tp_2.time_since_epoch ().count () << "\n";
	std::cout << "tp_2 text: " << tp_2 << "\n";
	return 0;
}

In g++ tp_1 == tp_2 but in Visual C++ tp_1 != tp_2. This bug seems to be present in VC++ versions from 2015 to 2022 (including the latest one with updates from 2023). I guess I will investigate it in more detail and will file a bug report with Visual C++.

The date/time parsing routines don't seem to be affected by the bug, i.e. the problem is in the test code itself, so maybe as a workaround, I will hardcode the timestamp as a direct integer value in the test code instead of computing it at runtime.

@MeanSquaredError
Copy link
Contributor Author

@rbock
I fixed the tests so now they pass correctly in Visual C++ too.
So I think that now the PR is ready for a review.

@rbock
Copy link
Owner

rbock commented Sep 5, 2023

@Erroneous1 has a good point, though. Databases (and the respective applications) might consider dates valid that the current checks would not. I once worked in a project that considered the 31st of February valid (made some math easier). Thanks for the reminder!

The original implementation accepted everything as long as it contained digits in the right places. No checks beyond that. If it did not find digits in the right places, it yielded a default constructed date/date_time. IIRC, that approach did not cause any problems in almost 10 years.

We should therefore remove the checks for validity of dates and times, IMO.

@MeanSquaredError
Copy link
Contributor Author

@rbock
OK, I removed the checks and updated the tests. All tests passed successfully.

@rbock
Copy link
Owner

rbock commented Sep 6, 2023

Thanks for the updates! I think we are good to go at this point.

If you agree, I would merge probably tomorrow.

@MeanSquaredError
Copy link
Contributor Author

MeanSquaredError commented Sep 6, 2023

@rbock
Later today I will prepare a simple benchmark program measuring the speeds of ymd.ok() and is_valid_ymd() and will post it along with a question here https://github.com/HowardHinnant/date/issues

Regarding the merge, I am OK with it.

If I remember correctly, the new manual parser behaves almost identically to the old manual parser (the pre-regex one), the difference being that when parsing dates/times the old parser required the parsed string to have a valid prefix representing a date/time and then used that prefix for the resulting time_point or duration. That is any trailing garbage was ignored.

On the other hand the new parser requires an exact match, i.e. any trailing garbage will cause the parse to fail and as a result will be replaced by the zero date/time.

I think that the new behavior is correct, because any unrecognized 'garbage' can change significantly the interpretation of the resulting value. For example PostgreSQL uses the BC suffix for dates before AD 0001-01-01. So for example the old parser will recognize 2000-01-01 BC incorrectly as AD 2000-01-01, while the new parser will fail to parse it and will replace it with day_point{}. That is why I think that the new behavior is more correct.

In my opinion sooner or later we should add support for parsing BC dates and other year ranges outside of 0000-9999, but for now anything falling outside of this year range should just be treated as day_point{}. Any other interpretation may result completely misleading results as with the BC dates.

@rbock rbock merged commit 25bca54 into rbock:main Sep 7, 2023
@rbock
Copy link
Owner

rbock commented Sep 7, 2023

Done! Thanks for your diligence!

Agreed, it makes sense to add support for an extended date range.

@MeanSquaredError MeanSquaredError deleted the speed_up_date_time branch September 7, 2023 06:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants