fmt/strtime: support non-ASCII in strftime format strings

I'm not sure why I didn't do this initially. I think I thought there wasn't a use case. But it's easy enough to support. This would have probably been easier if we accepted a `&str` for the format string, especially since we require valid UTF-8 (which is a strict relaxation from requiring valid ASCII). In that case, we would just iterate over `chars()`. But, that would require the caller to create the `&str` (thus paying for UTF-8 validation) and would require `strftime` to do UTF-8 decoding. This way, we don't require any of that up-front UTF-8 validation, and in the vast majority of cases (ASCII-only), we just handle things one-byte-at-a-time without caring about UTF-8. Fixes #155
BurntSushi · Jan 1, 2025 · 1ecb1ee · 1ecb1ee
1 parent 9962d03
commit 1ecb1ee
Show file tree

Hide file tree

Showing 3 changed files with 63 additions and 9 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,6 +9,11 @@ Enhancements:
 * [#130](https://github.com/BurntSushi/jiff/issues/130):
 Document value ranges for methods like `year`, `day`, `hour` and so on.
 
+Bug fixes:
+
+* [#155](https://github.com/BurntSushi/jiff/issues/155):
+Relax `strftime` format strings from ASCII-only to all of UTF-8.
+
 
 0.1.18 (2024-12-31)
 ===================

diff --git a/src/fmt/strtime/format.rs b/src/fmt/strtime/format.rs
@@ -9,7 +9,7 @@ use crate::{
         Write, WriteExt,
     },
     tz::Offset,
-    util::escape,
+    util::{escape, utf8},
     Error,
 };
 
@@ -23,15 +23,13 @@ impl<'f, 't, 'w, W: Write> Formatter<'f, 't, 'w, W> {
     pub(super) fn format(&mut self) -> Result<(), Error> {
         while !self.fmt.is_empty() {
             if self.f() != b'%' {
-                if !self.f().is_ascii() {
-                    return Err(err!(
-                        "found non-ASCII byte {byte:?} in format \
-                         string (ASCII is currently required)",
-                        byte = escape::Byte(self.f()),
-                    ));
+                if self.f().is_ascii() {
+                    self.wtr.write_char(char::from(self.f()))?;
+                    self.bump_fmt();
+                } else {
+                    let ch = self.utf8_decode_and_bump()?;
+                    self.wtr.write_char(ch)?;
                 }
-                self.wtr.write_char(char::from(self.f()))?;
-                self.bump_fmt();
                 continue;
             }
             if !self.bump_fmt() {
@@ -141,6 +139,35 @@ impl<'f, 't, 'w, W: Write> Formatter<'f, 't, 'w, W> {
         !self.fmt.is_empty()
     }
 
+    /// Decodes a Unicode scalar value from the beginning of `fmt` and advances
+    /// the parser accordingly.
+    ///
+    /// If a Unicode scalar value could not be decoded, then an error is
+    /// returned.
+    ///
+    /// It would be nice to just pass through bytes as-is instead of doing
+    /// actual UTF-8 decoding, but since the `Write` trait only represents
+    /// Unicode-accepting buffers, we need to actually do decoding here.
+    ///
+    /// # Panics
+    ///
+    /// When `self.fmt` is empty. i.e., Only call this when you know there is
+    /// some remaining bytes to parse.
+    #[inline(never)]
+    fn utf8_decode_and_bump(&mut self) -> Result<char, Error> {
+        match utf8::decode(self.fmt).expect("non-empty fmt") {
+            Ok(ch) => {
+                self.fmt = &self.fmt[ch.len_utf8()..];
+                return Ok(ch);
+            }
+            Err(invalid) => Err(err!(
+                "found invalid UTF-8 byte {byte:?} in format \
+                 string (format strings must be valid UTF-8)",
+                byte = escape::Byte(invalid),
+            )),
+        }
+    }
+
     /// Parses optional extensions before a specifier directive. That is, right
     /// after the `%`. If any extensions are parsed, the parser is bumped
     /// to the next byte. (If no next byte exists, then an error is returned.)

diff --git a/src/fmt/strtime/mod.rs b/src/fmt/strtime/mod.rs
@@ -2168,4 +2168,26 @@ mod tests {
             @"2024-07-30T00:56:25+04:00[+04:00]",
         );
     }
+
+    // Regression test for format strings with non-ASCII in them.
+    //
+    // We initially didn't support non-ASCII because I had thought it wouldn't
+    // be used. i.e., If someone wanted to do something with non-ASCII, then
+    // I thought they'd want to be using something more sophisticated that took
+    // locale into account. But apparently not.
+    //
+    // See: https://github.com/BurntSushi/jiff/issues/155
+    #[test]
+    fn ok_non_ascii() {
+        let fmt = "%Y年%m月%d日，%H时%M分%S秒";
+        let dt = crate::civil::date(2022, 2, 4).at(3, 58, 59, 0);
+        insta::assert_snapshot!(
+            dt.strftime(fmt),
+            @"2022年02月04日，03时58分59秒",
+        );
+        insta::assert_debug_snapshot!(
+            DateTime::strptime(fmt, "2022年02月04日，03时58分59秒").unwrap(),
+            @"2022-02-04T03:58:59",
+        );
+    }
 }