-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parse-zoneinfo: replace rule parser with simple state machine #172
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive you could write this in little time!
I wonder if it wouldn't take less code to initialize Rule
with default values and update it's fields, instead of moving around all fields to the next variant in RuleState
.
Do you want to convert the zone, continuation and link line parsers in the same PR?
use parse_zoneinfo::line::{Line, LineParser}; | ||
use parse_zoneinfo::FILES; | ||
|
||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added #[ignore]
here because this test will fail every time we update the tz data. Not sure how big of a pain in the ass that will be to update? Using the cargo-insta
tooling it is pretty easy so we might decide to just include this.
So the package test fails because I've made chrono-tz-build depend on the |
I'll have a look tomorrow (also on the other PR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not reached the end yet 😄.
It seems at some point (2017c?) zic
became case-insensitive for things like Rule
, Zone
, Link
, weekdays, month names, last
. Something we should eventually support?
@@ -17,7 +17,7 @@ case-insensitive = ["uncased", "phf/uncased"] | |||
regex = ["dep:regex"] | |||
|
|||
[dependencies] | |||
parse-zoneinfo = { version = "0.3" } | |||
parse-zoneinfo = { version = "0.3", path = "../parse-zoneinfo" } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking to make these changes in my next PR 👍.
@@ -38,3 +38,15 @@ pub mod line; | |||
pub mod structure; | |||
pub mod table; | |||
pub mod transitions; | |||
|
|||
pub const FILES: &[&str] = &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure we want to hardcode this list in parse-zoneinfo.
For my personal experiments the past year I removed backward
, included backzone
, occasionally included factory
, and filtered parts of etcetera
.
Maybe move the change out of this PR so we can discuss it separately?
if input.chars().all(|c| c.is_ascii_digit()) { | ||
return Ok(DaySpec::Ordinal(input.parse().unwrap())); | ||
} | ||
// Check if it stars with ‘last’, and trim off the first four bytes if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Check if it stars with ‘last’, and trim off the first four bytes if | |
// Check if it starts with ‘last’, and trim off the first four bytes if |
return Ok(DaySpec::Ordinal(input.parse().unwrap())); | ||
} | ||
// Check if it stars with ‘last’, and trim off the first four bytes if | ||
// it does. (Luckily, the file is ASCII, so ‘last’ is four bytes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't care about ASCII with strip_prefix
, right? This seems an old comment.
return Ok(DaySpec::Last(weekday)); | ||
} | ||
|
||
let weekday = match input.get(..3) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, didn't know this method!
zic.c
has the following comment for parsing a day column:
/*
** Day work.
** Accept things such as:
** 1
** lastSunday
** last-Sunday (undocumented; warn about this)
** Sun<=20
** Sun>=7
*/
I think we should support parsing full weekday names like zic
like we did with the regex, but maybe skip the last-{weekday}
case.
Can you add a test for DaySpec::from_str
?
impl FromStr for TimeSpecAndType { | ||
type Err = Error; | ||
|
||
fn from_str(input: &str) -> Result<Self, Error> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please split this method over the TimeSpec
and TimeSpecAndType
types? I am an not sure yet if anything but wall times is allowed zone lines, and if the existing code took a shortcut there that we want to fix.
from_year, | ||
to_year, | ||
}, | ||
"-" | "\u{2010}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add back the comment?
impl<'a> Rule<'a> { | ||
fn from_str(input: &'a str) -> Result<Self, Error> { | ||
let mut state = RuleState::Start; | ||
for part in input.split_ascii_whitespace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This no longer parses a rule with a comment?
zic.c
has a getfields
method (line 3722) that returns when it encounters a comment sign #
.
It also supports quotation marks "
surrounding each field, within which whitespace and #
is allowed. Maybe we should make an iterator that works similar instead of using split_ascii_whitespace
?
let mut state = ZoneInfoState::Start; | ||
for part in iter { | ||
state = match (state, part) { | ||
(st, _) if part.starts_with('#') => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory a comment is allowed to come straight after a field, without whitespace in between.
@@ -13,3 +13,6 @@ keywords = ["date", "time", "timezone", "zone", "calendar"] | |||
version = "1.3.1" | |||
default-features = false | |||
features = ["std", "unicode-perl"] | |||
|
|||
[dev-dependencies] | |||
insta = "1.38" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand why you added this test. Not sure about it though.
Would it be better to add this test as a separate crate in the workspace?
The raw diffstat of +4957/-371 doesn't look so attractive, but this account new tests (and accompanying data) that account for about 4400 of those lines added, so all in all this doesn't add that much more code than it deletes. The
benchmark
example suggests it is about 10x faster and it drops a pretty big dependency.