Skip to content

std::str::split_inclusive gives unexpected results compared to std::str::split #111457

Open
@vdods

Description

@vdods

Up front, thanks everyone who has worked on Rust for creating a fantastic language :-)

I tried this code:

fn run_case_char(s: &str, sep: char) {
    println!("{:?} --- split_inclusive --> {:?}", s, s.split_inclusive(sep).collect::<Vec<_>>());
    println!("{:?} --- split           --> {:?}", s, s.split(sep).collect::<Vec<_>>());
}

fn main() {
    run_case_char("xsys", 's');
    run_case_char("xsy", 's');
    run_case_char("xs", 's');
    run_case_char("x", 's');
    run_case_char("", 's');
}

I expected to see this happen: I expected the output of std::str::split_inclusive to be identical to that of std::str::split except with the separator included. In particular, I expected the same number of items in the iterator. The precise output was:

"xsys" --- split_inclusive --> ["xs", "ys"]
"xsys" --- split           --> ["x", "y", ""]
"xsy" --- split_inclusive --> ["xs", "y"]
"xsy" --- split           --> ["x", "y"]
"xs" --- split_inclusive --> ["xs"]
"xs" --- split           --> ["x", ""]
"x" --- split_inclusive --> ["x"]
"x" --- split           --> ["x"]
"" --- split_inclusive --> []
"" --- split           --> [""]

Instead, this happened: In the calls to std::str::split_inclusive, if the last substring was the empty string, it was not included in the result. This was extra surprising when the input string was the empty string, in which case the resulting iterator has no elements.

I see an explanation of this behavior under the examples section of the documentation for std::str::split_inclusive: "If the last element of the string is matched, that element will be considered the terminator of the preceding substring. That substring will be the last item returned by the iterator." However, this seems to contradict the definitional description: "An iterator over substrings of this string slice, separated by characters matched by a pattern. Differs from the iterator produced by split in that split_inclusive leaves the matched part as the terminator of the substring."

Anyway, a concrete example of why I think the empty string should not be ignored at the end is producing a contiguous segmentation of a string into newline-terminated lines that agrees with the line count. The line count is 1 plus the number of newlines in the string, and the last line may well be the empty string, but it's no less valid as a line.

Looking at the source, I see that the implementation of the method is

    pub fn split_inclusive<'a, P: Pattern<'a>>(&'a self, pat: P) -> SplitInclusive<'a, P> {
        SplitInclusive(SplitInternal {
            start: 0,
            end: self.len(),
            matcher: pat.into_searcher(self),
            allow_trailing_empty: false,
            finished: false,
        })
    }

and in particular, the presence of allow_trailing_empty implies that in principle either behavior could be specified easily, though obviously that's hidden behind the private type SplitInternal.

Anyway, I realize that it's probably not feasible to change the behavior of the existing method. I would be in favor of adding the ability to specify allow_trailing_empty somehow.

Meta

rustc --version --verbose:

rustc 1.67.1 (d5a82bbd2 2023-02-07)
binary: rustc
commit-hash: d5a82bbd26e1ad8b7401f6a718a9c57c96905483
commit-date: 2023-02-07
host: x86_64-unknown-linux-gnu
release: 1.67.1
LLVM version: 15.0.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancementCategory: An issue proposing an enhancement or a PR with one.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions