Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parser improvements #202

Merged
merged 5 commits into from
Aug 15, 2013
Merged

Parser improvements #202

merged 5 commits into from
Aug 15, 2013

Conversation

davispuh
Copy link
Collaborator

@davispuh davispuh commented Aug 5, 2013

Current implementation have reached it's limits and it's very hard to add some specific new date/time format parsing, thus there's need for some refactoring. While general design/idea with Tokens and Tags are very good there's missing some steps in implementation and due that over years there have been made several hacks (especially in pre_normalize). It's not good to use so many regexp pre-processing if it all could be done with proper parser.
Also currently codebase is quite complicated in some areas and thus it's not easy to contribute. It requires lot of time to properly investigate how it does work.

So my idea is to implement proper parsing and refactor some code parts. It would make it much easier to add new formats for parsing. Some formats which are parsed with current implementation - they does work because of pure luck I would say and overall this library seems quite hackish...

For example I wondered why this date 2013-08-05T22:55:03+03:00 even worked, then I noticed that 05T22:55:03+03:00 is parsed as TimeZone and whole :text is passed to generic Time.parse. I think that's not good at all...

Ok, now more about my idea. It's quite simple, first we divide whole input string by smallest possible tokens (like numbers, separators (dash, colon, etc.)) then we group them together so they make Date, Time and TimeZone (I haven't decided yet what would be best way for this grouping) and only finally we handle those groups based on which combination is most likely correct.
Also I think there are way too many handlers and probably most are unneeded and kinda duplicate code as in end it doesn't matter if it's ScalarMonth or RepeaterMonthName because that's same thing. Basically handlers should just create date/time from already parsed numbers.

Now about this PR. These are first changes in refactoring about which I was talking. I'll send some few more PRs, but it will take some time and changes probably will be quite big, even this PR isn't small, but here I barely changed anything.
About changes in this PR.

  • Improved Handler.match so pattern array can contain arrays of tags. It means it will match any of them (simple or), eg. Chronic::Handler.new([[:separator_on, :separator_at], :scalar], :handler) will match on 2 or at 2
  • Replaced SeparatorSlashOrDash to separate SeparatorSlash and SeparatorDash
  • Added more types for separators: dot, colon, space, quote, T, W
  • Refactored Scalar and Ordinal classes so they're cleaner and no need to match \d all the time, can do it only once
  • Created Date and Time classes for Date/Time related functionality
  • Added Sign class for parsing - and +
  • Pass options as 2nd arg to any Repeater.new otherwise it wasn't possible for them to receive any options
  • Added :hours24 option to be able to force 24-hour clock (resolves Force 24h mode #201). When :hours24 is true then there's no ambiguous time and it will be always correct. If it's false then 03:00:00 will be treated as 12-hour clock, but when nil it's same behavior as currently was.
  • Added tests for Handler.match and :hours24

Some changes in this PR aren't really used yet such as so many Separators, but they'll be needed in future.

All tests does pass and everything is backwards compatible. But still should review carefully as I consider any date/time format for which there's no tests as unsupported (because I don't know about them). Also I think it's good time to release new version now as current state is somewhat okay.

@davispuh
Copy link
Collaborator Author

davispuh commented Aug 6, 2013

well, naming that's the hardest part :D
I don't know what would be best name. Here some more: :force24h, :force_hours24, :clock24h, :force_24h_clock, :time24h, :time_clock24h

I'll commit changes when option name is chosen.

@davispuh
Copy link
Collaborator Author

davispuh commented Aug 9, 2013

any ideas about that option name?

@leejarvis
Copy link
Collaborator

No, I'm happy to leave it for now, we can consider changing it any time before a full 1.0 is released (some day). I'll merge this later. Thanks for the work!

Change SeparatorSlashOrDash to separate Slash and Dash
Add more separator types (Dot, Colon, Space, Quote, T, W)
Create Date and Time classes
Add Sign class
@davispuh
Copy link
Collaborator Author

Nice :)
there wasn't any conflicts, but I still rebased. So it should merge fine.

about next PR, I'm currently thinking how to properly implement timezone parsing. This current implementation doesn't actually parse it at all and just pass it to generic Time.parse, but I've successfully implemented all parsing so Time.parse is not needed at all with only exception being timezones. With my googling I didn't find way to convert timezone abbreviations/names to offsets (in Ruby it's hardcoded /ext/date/date_parse.c#L341 and not exposed outside) so I'm thinking to create separate gem which does that and add dependency on it. Also then it would be possible to offer support for timezone names in different languages.

leejarvis added a commit that referenced this pull request Aug 15, 2013
@leejarvis leejarvis merged commit 4e3081f into mojombo:master Aug 15, 2013
@leejarvis
Copy link
Collaborator

This is really great. Thanks a lot for the work! I pinged @mojombo on Twitter about giving you commit bit (or moving the repo to my account, which might make more sense in the long run), hopefully we can get that actioned at some point.

@davispuh
Copy link
Collaborator Author

I don't really need commit access. Just when I started reviewing this I noticed how poorly (IMHO) some things are implemented that I wanted to fix it once and for all :D So basically idea is that you would throw in any commonly used date/time format and it would parse it successfully.

@davispuh
Copy link
Collaborator Author

It appears that TZ database includes even abbreviations. And they can be accessed from TZInfo gem, only thing is that different abbreviations were used at different times so one can represent multiple offsets depending at which time and in which country you're in. I quickly generated all abbreviations to possible offsets gists/TimezoneAbbr_To_Offsets.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Force 24h mode
2 participants