-
-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Port StringView to Base
#60526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Port StringView to Base
#60526
Conversation
|
Test and build failures appear unrelated. |
4cfff69 to
aef8a48
Compare
| return StringView(cu[i:j]) | ||
| end | ||
|
|
||
| function chomp(s::StringViewAndSub) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could have a unified method here for UTF8String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite - the String method can e.g. use inbounds and assume_effects, but StringView needs to be more careful as the codeunits are generic
|
|
||
| * `IOContext` supports a new boolean `hexunsigned` option that allows for | ||
| printing unsigned integers in decimal instead of hexadecimal ([#60267]). | ||
| * The `StringView` type wraps an `AbstractVector{UInt8}` and interprets it as a UTF-8 encoded string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * The `StringView` type wraps an `AbstractVector{UInt8}` and interprets it as a UTF-8 encoded string. | |
| * The `StringView` type wraps an `AbstractVector{UInt8}` and interprets it as a UTF-8 encoded string, superseding the [StringViews.jl](https://github.com/JuliaStrings/StringViews.jl) package ([#60526]). |
|
Did you not copy over the optimized |
|
The recent generic string hashing function is quite fast - a StringView hashes almost as fast as a String. So I don't think there is a need to specialize |
|
Why not make it exactly as fast, when all it requires is a 1-line change to broaden this method: Line 636 in c3e9456
Contrariwise, if there is no performance advantage to the specialized |
This PR ports StringViews.jl to Base, as discussed in #60037. Closes #60037.
Tests have been written by a chat-bot and reviewed by me
Decisions for reviewers before reviewing
@inboundswith aStringViewwrapping an unknown array type? The code currently does, but I'm not against removing it. If we do remove it, we may need to re-jigger a bunch of string related code, since shared code paths of strings and string views would mean thatStringperformance could be degraded by removing inbounds annotations.Motivation
We generally want to avoid adding more types to Base when it can live in a package, so this PR requires some justification:
StringViews are the more fundamental string
StringandSubStringare an abstraction over an underlying byte array, and almost all its operations are defined in terms of loading bytes from that array. This abstraction of strings as "arrays in a trench coat" is made explicit by string views.Hence,
StringandSubStringcan be implemented in terms of string views, but not the other way around. And it is better if Base contains the foundations and packages provide implementations on top of those, instead of the other way around.One hint that the relationship is inverted is to look at the implementation of
StringViews.jl: It re-implements tonnes of internal Base string functions, instead of calling into generic, foundational methods in Base.We also have comments like this in Base:
# duck-type s so that external UTF-8 string packages like StringViews can hook in, where the sensible thing would be to define getindex for UTF8-encoded strings in terms of this method, instead of encouraging use of Base internals.Due to the strong intersection of strings as an abstraction and string views, this PR only has ~100 LOC in
base/strings/stringview.jl, whereassrc/inStringViews.jlhas ~800 LOC.Relatedly, I find it unfortunate that this relationship between strings as an abstraction, and arrays as the underlying implementation, is frequently inverted for
Stringin Base - code will often define operations onCodeUnits{UInt8, String}in terms of operations on the string, instead of the other way around. This makes the 'dispatch funnels' awkward because it does not match the fundamental truth that string operations are ultimately defined (even in Base) in terms of byte operations.StringViews could be used in Base itself
Base does a bunch of parsing, e.g. on TOML and Tar files. In my view, string views are very useful for zero-copy operations for efficient parsing. For example, to parse a
Float32from a byte buffer, we now jave to allocate an intermediateString. If we ever want to optimise Tar and TOML parsing, we would need something like string views at some point.Arguments against this PR:
@StefanKarpinski mentioned that he and @oscardssmith was discussing some string redesign that would lessen the need for string views. I don't know the details, but would like to know more. If string views are made obsolete in the future, it may be a bad idea to add this type.
This PR also does complicate some dispatch systems. Since we don't have an interface for expressing denseness, this PR worsens the situation where dispatch is controlled by "big unions" that attempt to encompass the same concept, such as e.g.
DenseString = Union{SubString{String}, String, StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}, SubString{<:StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}}}.By laying extra complexity on a not-so-consistent mess of dispatch and half-abstractions, this PR makes the situations somewhat worse. This problem is possibly temporary, and could be solved, but it does mean that the existence of string views in Base makes existing Base code more complex, and therefore increases the risk of bugs.
Finally, adding any new stuff to Base has obvious downsides: It makes it impossible to change without breaking Base Julia, makes Base CI take longer, contributes to 'feature overload' and so on.
Differences from StringViews.jl to this PR
The
StringViewtype in this PR is different from that in StringViews.jl in several respects, listed below. The purpose of the changes I've made is to make the implementation more 'conservative', such that the semantics of string views are more in line with existing Base code.I have also removed a bunch of auxiliary methods that are slightly off, semantically, mixing up strings, vectors and IOs.
Vector{UInt8}(s::StringView{Vector{UInt8}})now copies, instead of accessing the wrapped vector, because it is confusing that theVector{UInt8}(::StringView)constructor sometimes and sometimes not aliases the string. Thecodeunitsfunction can be used to unambiguously get a vector that aliases the string.StringView(::String)has been removed. String views are - by definition - built fromAbstractVector{UInt8}, not strings. If you for (some weird) reason wants a string view wrapping a string, doStringView(codeunits(::String))StringView(::StringView)is also removed. We construct string views from vectors, and nothing else.StringView(buf::IOBuffer)constructor has been removed. The purpose of this function was to wrap the buffer ofIOBuffer. However, this should be done properly with a function that gets the buffer ofIOBufferas anAbstractVector{UInt8}, then wrapping that in a string view.s[1:2]now copies, instead of returning a view. It has also been type-stabilized.SVRegexMatchhas been removed. Regex matching over a string view just returns aRegexMatch. This is objectively better, but was not done in StringViews.jl becauseRegexMatchused to not be generic over the target string.reverse(::StringView)now returns aStringView{Memory{UInt8}}. It would be nicer to return the same type as the input, but that is not possible generically.StringViewcan only be constructed with a one-based indexed vector. This restriction may be lifted in the future. This check has been added because some existing StringView code assumed that and I didn't want to re-invent StringViews.jl too much in this PR.StringView{T1}wrapping aT2whereT2 <: T1andT1 !== T2. This just complicates the code for no good reason.Future work
More work could be done to refactor the "dispatch funnels" for
AbstractString. In particular, routing more methods throughcodeunits, such that the functions operate onAbstractVectorcould simplify quite a bit. However, this is best left for a future PR.