Skip to content

Conversation

@jakobnissen
Copy link
Member

@jakobnissen jakobnissen commented Jan 1, 2026

This PR ports StringViews.jl to Base, as discussed in #60037. Closes #60037.

Tests have been written by a chat-bot and reviewed by me

Decisions for reviewers before reviewing

  • Can we use @inbounds with a StringView wrapping an unknown array type? The code currently does, but I'm not against removing it. If we do remove it, we may need to re-jigger a bunch of string related code, since shared code paths of strings and string views would mean that String performance could be degraded by removing inbounds annotations.

Motivation

We generally want to avoid adding more types to Base when it can live in a package, so this PR requires some justification:

StringViews are the more fundamental string

String and SubString are an abstraction over an underlying byte array, and almost all its operations are defined in terms of loading bytes from that array. This abstraction of strings as "arrays in a trench coat" is made explicit by string views.
Hence, String and SubString can be implemented in terms of string views, but not the other way around. And it is better if Base contains the foundations and packages provide implementations on top of those, instead of the other way around.

One hint that the relationship is inverted is to look at the implementation of StringViews.jl: It re-implements tonnes of internal Base string functions, instead of calling into generic, foundational methods in Base.
We also have comments like this in Base: # duck-type s so that external UTF-8 string packages like StringViews can hook in, where the sensible thing would be to define getindex for UTF8-encoded strings in terms of this method, instead of encouraging use of Base internals.

Due to the strong intersection of strings as an abstraction and string views, this PR only has ~100 LOC in base/strings/stringview.jl, whereas src/ in StringViews.jl has ~800 LOC.

Relatedly, I find it unfortunate that this relationship between strings as an abstraction, and arrays as the underlying implementation, is frequently inverted for String in Base - code will often define operations on CodeUnits{UInt8, String} in terms of operations on the string, instead of the other way around. This makes the 'dispatch funnels' awkward because it does not match the fundamental truth that string operations are ultimately defined (even in Base) in terms of byte operations.

StringViews could be used in Base itself

Base does a bunch of parsing, e.g. on TOML and Tar files. In my view, string views are very useful for zero-copy operations for efficient parsing. For example, to parse a Float32 from a byte buffer, we now jave to allocate an intermediate String. If we ever want to optimise Tar and TOML parsing, we would need something like string views at some point.

Arguments against this PR:

@StefanKarpinski mentioned that he and @oscardssmith was discussing some string redesign that would lessen the need for string views. I don't know the details, but would like to know more. If string views are made obsolete in the future, it may be a bad idea to add this type.

This PR also does complicate some dispatch systems. Since we don't have an interface for expressing denseness, this PR worsens the situation where dispatch is controlled by "big unions" that attempt to encompass the same concept, such as e.g. DenseString = Union{SubString{String}, String, StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}, SubString{<:StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}}}.
By laying extra complexity on a not-so-consistent mess of dispatch and half-abstractions, this PR makes the situations somewhat worse. This problem is possibly temporary, and could be solved, but it does mean that the existence of string views in Base makes existing Base code more complex, and therefore increases the risk of bugs.

Finally, adding any new stuff to Base has obvious downsides: It makes it impossible to change without breaking Base Julia, makes Base CI take longer, contributes to 'feature overload' and so on.

Differences from StringViews.jl to this PR

The StringView type in this PR is different from that in StringViews.jl in several respects, listed below. The purpose of the changes I've made is to make the implementation more 'conservative', such that the semantics of string views are more in line with existing Base code.

I have also removed a bunch of auxiliary methods that are slightly off, semantically, mixing up strings, vectors and IOs.

  • Vector{UInt8}(s::StringView{Vector{UInt8}}) now copies, instead of accessing the wrapped vector, because it is confusing that the Vector{UInt8}(::StringView) constructor sometimes and sometimes not aliases the string. The codeunits function can be used to unambiguously get a vector that aliases the string.
  • The constructor StringView(::String) has been removed. String views are - by definition - built from AbstractVector{UInt8}, not strings. If you for (some weird) reason wants a string view wrapping a string, do StringView(codeunits(::String))
  • Similarly, StringView(::StringView) is also removed. We construct string views from vectors, and nothing else.
  • The StringView(buf::IOBuffer) constructor has been removed. The purpose of this function was to wrap the buffer of IOBuffer. However, this should be done properly with a function that gets the buffer of IOBuffer as an AbstractVector{UInt8}, then wrapping that in a string view.
  • Slicing into a string view, e.g. s[1:2] now copies, instead of returning a view. It has also been type-stabilized.
  • SVRegexMatch has been removed. Regex matching over a string view just returns a RegexMatch. This is objectively better, but was not done in StringViews.jl because RegexMatch used to not be generic over the target string.
  • reverse(::StringView) now returns a StringView{Memory{UInt8}}. It would be nicer to return the same type as the input, but that is not possible generically.
  • For now, StringView can only be constructed with a one-based indexed vector. This restriction may be lifted in the future. This check has been added because some existing StringView code assumed that and I didn't want to re-invent StringViews.jl too much in this PR.
  • It is not permitted to construct a StringView{T1} wrapping a T2 where T2 <: T1 and T1 !== T2. This just complicates the code for no good reason.

Future work

More work could be done to refactor the "dispatch funnels" for AbstractString. In particular, routing more methods through codeunits, such that the functions operate on AbstractVector could simplify quite a bit. However, this is best left for a future PR.

@jakobnissen jakobnissen marked this pull request as ready for review January 2, 2026 14:19
@jakobnissen jakobnissen changed the title WIP: Port StringView to Base Port StringView to Base Jan 2, 2026
@jakobnissen jakobnissen added strings "Strings!" feature Indicates new feature / enhancement requests status: waiting for PR reviewer labels Jan 2, 2026
@jakobnissen
Copy link
Member Author

Test and build failures appear unrelated.

return StringView(cu[i:j])
end

function chomp(s::StringViewAndSub)
Copy link
Member

@stevengj stevengj Jan 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could have a unified method here for UTF8String?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite - the String method can e.g. use inbounds and assume_effects, but StringView needs to be more careful as the codeunits are generic


* `IOContext` supports a new boolean `hexunsigned` option that allows for
printing unsigned integers in decimal instead of hexadecimal ([#60267]).
* The `StringView` type wraps an `AbstractVector{UInt8}` and interprets it as a UTF-8 encoded string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The `StringView` type wraps an `AbstractVector{UInt8}` and interprets it as a UTF-8 encoded string.
* The `StringView` type wraps an `AbstractVector{UInt8}` and interprets it as a UTF-8 encoded string, superseding the [StringViews.jl](https://github.com/JuliaStrings/StringViews.jl) package ([#60526]).

@stevengj
Copy link
Member

stevengj commented Jan 6, 2026

Did you not copy over the optimized hash method that calls hash_bytes, or am I missing something? Seems like you could just broaden hash(::String, ::UInt) to hash(::DenseUTF8String, ::UInt).

@jakobnissen
Copy link
Member Author

The recent generic string hashing function is quite fast - a StringView hashes almost as fast as a String. So I don't think there is a need to specialize

@stevengj
Copy link
Member

stevengj commented Jan 7, 2026

Why not make it exactly as fast, when all it requires is a 1-line change to broaden this method:

@assume_effects :total hash(data::String, h::UInt) =

Contrariwise, if there is no performance advantage to the specialized hash(::String) method, why not delete that method?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Indicates new feature / enhancement requests status: waiting for PR reviewer strings "Strings!"

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Have a StringView in Base

3 participants