Implement UnicodeSegmentation for Iterator<item = char> #28

cbarrick · 2017-06-18T08:21:14Z

It would be nice to segment character iterators, especially for interoperability with the unicode-normalization crate. This could provide a solution to #7 when/if io::Chars stabilizes. In particular, I'd like to write a tokenizer like this:

let input: BufRead = my_input();
let tokens = input.chars().nfkc().split_word_bounds();

One issue I see is that most of the public structs provide an as_str method that returns "the underlying data (the part yet to be iterated) as a slice of the original string". This obviously won't work with streaming types.

The text was updated successfully, but these errors were encountered:

ghost · 2017-07-14T03:20:52Z

I will work on this.

ghost · 2017-07-14T16:01:57Z

@CryZe can you expain your CharOrBoundary idea?

CryZe · 2017-07-14T16:19:10Z

The problem is that unicode-segmentation is built around borrowing str slices from some owned data, like a String. That doesn't work well with char iterators, as those aren't stored anywhere. So to support this at all, you would need to introduce a separate streaming API, that provides Iterator<Item = CharOrBoundary> iterators with CharOrBoundary being

pub enum CharOrBoundary {
    Char(char),
    Boundary,
}

So you can then iterate over the characters and it'll tell you whether you hit a boundary or not. You could then have other helper functions that help you collect char segments between the boundaries into buffers.

HadrienG2 · 2019-01-10T07:54:06Z

+1 to this idea, would also be useful when working with non-UTF8 strings in legacy APIs.

HadrienG2 · 2019-01-11T07:00:02Z

To clarify, I think that providing the full UnicodeSegmentation on top of Iterator<Item=char> is hopeless because the existing UnicodeSegmentation API heavily assumes access to an underlying &str all over the place, and in the case of an Iterator<Item=char> there may not be one.

What we could provide, however, is something that turns an Iterator<Item=char> into an Iterator<Item=Iterator<Item=char>> of sorts that represents graphemes or words (may need to be a streaming iterator if we don't want to impose a Clone bound on the underlying Iterator, or if we want to avoid parsing the text twice).

Since GraphemeCursor::next_boundary() already works on top of a char iterator, it might be possible to rewrite it in terms of this API in order to avoid code duplication. For words, it's less clear how to proceed, as the implementation makes even more UTF-8 string assumptions, such as manipulating string indices under the hood.

HadrienG2 · 2019-01-12T07:16:49Z

I looked into it further and tried to adapt parts of the GraphemeCursor implementation to streaming use cases. From this experiment, it seems to me that it is impossible to provide both of the following API properties at the same time while keeping the implementation sane:

Ability to work with an incomplete view of the input string and add more of it as needed.
Ability to work with streams of char.

The reason is that in the current API, extra input is "patched together" with existing one using UTF-8 indices as a unifying abstraction, and AFAIK there is no nice equivalent in an Iterator<Item=char> world. An incomplete replacement would be some ability to attach an extra iterator at the end of the existing one using Iterator::chain(), but that would not address the full generality of the current API, which can also work with overlapping chunks of UTF-8 (though whether one would want to ever use them is up for debate).

So unless I'm missing something obvious, it seems to me that the least bad option is to stick with the existing code, collect the iterator of chars into a (possibly truncated) UTF-8 string and use unicode_segmentation on that.

HadrienG2 mentioned this issue Jan 10, 2019

Improve &str <-> &CStr16 lifecycle rust-osdev/uefi-rs#73

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement UnicodeSegmentation for Iterator<item = char> #28

Implement UnicodeSegmentation for Iterator<item = char> #28

cbarrick commented Jun 18, 2017

ghost commented Jul 14, 2017

ghost commented Jul 14, 2017

CryZe commented Jul 14, 2017

HadrienG2 commented Jan 10, 2019

HadrienG2 commented Jan 11, 2019

HadrienG2 commented Jan 12, 2019 •

edited

Loading

Implement UnicodeSegmentation for Iterator<item = char> #28

Implement UnicodeSegmentation for Iterator<item = char> #28

Comments

cbarrick commented Jun 18, 2017

ghost commented Jul 14, 2017

ghost commented Jul 14, 2017

CryZe commented Jul 14, 2017

HadrienG2 commented Jan 10, 2019

HadrienG2 commented Jan 11, 2019

HadrienG2 commented Jan 12, 2019 • edited Loading

HadrienG2 commented Jan 12, 2019 •

edited

Loading