Skip to content

Commit

Permalink
fmt/strtime: support non-ASCII in strftime format strings
Browse files Browse the repository at this point in the history
I'm not sure why I didn't do this initially. I think I thought there
wasn't a use case. But it's easy enough to support.

This would have probably been easier if we accepted a `&str` for the
format string, especially since we require valid UTF-8 (which is a
strict relaxation from requiring valid ASCII). In that case, we would
just iterate over `chars()`. But, that would require the caller to
create the `&str` (thus paying for UTF-8 validation) and would require
`strftime` to do UTF-8 decoding. This way, we don't require any of that
up-front UTF-8 validation, and in the vast majority of cases
(ASCII-only), we just handle things one-byte-at-a-time without caring
about UTF-8.

Fixes #155
  • Loading branch information
BurntSushi committed Jan 1, 2025
1 parent 9962d03 commit 1ecb1ee
Show file tree
Hide file tree
Showing 3 changed files with 63 additions and 9 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ Enhancements:
* [#130](https://github.com/BurntSushi/jiff/issues/130):
Document value ranges for methods like `year`, `day`, `hour` and so on.

Bug fixes:

* [#155](https://github.com/BurntSushi/jiff/issues/155):
Relax `strftime` format strings from ASCII-only to all of UTF-8.


0.1.18 (2024-12-31)
===================
Expand Down
45 changes: 36 additions & 9 deletions src/fmt/strtime/format.rs
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ use crate::{
Write, WriteExt,
},
tz::Offset,
util::escape,
util::{escape, utf8},
Error,
};

Expand All @@ -23,15 +23,13 @@ impl<'f, 't, 'w, W: Write> Formatter<'f, 't, 'w, W> {
pub(super) fn format(&mut self) -> Result<(), Error> {
while !self.fmt.is_empty() {
if self.f() != b'%' {
if !self.f().is_ascii() {
return Err(err!(
"found non-ASCII byte {byte:?} in format \
string (ASCII is currently required)",
byte = escape::Byte(self.f()),
));
if self.f().is_ascii() {
self.wtr.write_char(char::from(self.f()))?;
self.bump_fmt();
} else {
let ch = self.utf8_decode_and_bump()?;
self.wtr.write_char(ch)?;
}
self.wtr.write_char(char::from(self.f()))?;
self.bump_fmt();
continue;
}
if !self.bump_fmt() {
Expand Down Expand Up @@ -141,6 +139,35 @@ impl<'f, 't, 'w, W: Write> Formatter<'f, 't, 'w, W> {
!self.fmt.is_empty()
}

/// Decodes a Unicode scalar value from the beginning of `fmt` and advances
/// the parser accordingly.
///
/// If a Unicode scalar value could not be decoded, then an error is
/// returned.
///
/// It would be nice to just pass through bytes as-is instead of doing
/// actual UTF-8 decoding, but since the `Write` trait only represents
/// Unicode-accepting buffers, we need to actually do decoding here.
///
/// # Panics
///
/// When `self.fmt` is empty. i.e., Only call this when you know there is
/// some remaining bytes to parse.
#[inline(never)]
fn utf8_decode_and_bump(&mut self) -> Result<char, Error> {
match utf8::decode(self.fmt).expect("non-empty fmt") {
Ok(ch) => {
self.fmt = &self.fmt[ch.len_utf8()..];
return Ok(ch);
}
Err(invalid) => Err(err!(
"found invalid UTF-8 byte {byte:?} in format \
string (format strings must be valid UTF-8)",
byte = escape::Byte(invalid),
)),
}
}

/// Parses optional extensions before a specifier directive. That is, right
/// after the `%`. If any extensions are parsed, the parser is bumped
/// to the next byte. (If no next byte exists, then an error is returned.)
Expand Down
22 changes: 22 additions & 0 deletions src/fmt/strtime/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2168,4 +2168,26 @@ mod tests {
@"2024-07-30T00:56:25+04:00[+04:00]",
);
}

// Regression test for format strings with non-ASCII in them.
//
// We initially didn't support non-ASCII because I had thought it wouldn't
// be used. i.e., If someone wanted to do something with non-ASCII, then
// I thought they'd want to be using something more sophisticated that took
// locale into account. But apparently not.
//
// See: https://github.com/BurntSushi/jiff/issues/155
#[test]
fn ok_non_ascii() {
let fmt = "%Y年%m月%d日,%H时%M分%S秒";
let dt = crate::civil::date(2022, 2, 4).at(3, 58, 59, 0);
insta::assert_snapshot!(
dt.strftime(fmt),
@"2022年02月04日,03时58分59秒",
);
insta::assert_debug_snapshot!(
DateTime::strptime(fmt, "2022年02月04日,03时58分59秒").unwrap(),
@"2022-02-04T03:58:59",
);
}
}

0 comments on commit 1ecb1ee

Please sign in to comment.