Skip to content

Commit b4a780d

Browse files
authored
string doc clarifications
Clarify that `firstindex(str)` should always be `1` for any `AbstractString`, as mentioned by @StefanKarpinski [here](#26133 (comment)). Also reference `prevind` and `eachindex`. Also introduce the "code unit" terminology and mention the `codeunit` functions.
1 parent 74f5328 commit b4a780d

File tree

1 file changed

+29
-7
lines changed

1 file changed

+29
-7
lines changed

doc/src/manual/strings.md

Lines changed: 29 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -179,12 +179,12 @@ julia> str[end]
179179
```
180180

181181
Many Julia objects, including strings, can be indexed with integers. The index of the first
182-
element is returned by [`firstindex(str)`](@ref), and the index of the last element
182+
element (the first character of a string) is returned by [`firstindex(str)`](@ref), and the index of the last element (character)
183183
with [`lastindex(str)`](@ref). The keyword `end` can be used inside an indexing
184184
operation as shorthand for the last index along the given dimension.
185-
Most indexing in Julia is 1-based: the first element of many integer-indexed objects is found at
186-
index 1. (As we will see below, this does not necessarily mean that the last element is found
187-
at index `n`, where `n` is the length of the string.)
185+
String indexing, like most indexing in Julia is 1-based: `firstindex` always returns `1` for any `AbstractString`.
186+
As we will see below, however, `lastindex(str)` is *not* in general the same as `length(str)` for a string,
187+
because some Unicode characters can occupy multiple "code units".
188188

189189
You can perform arithmetic and other operations with `end`, just like
190190
a normal value:
@@ -264,10 +264,13 @@ julia> s = "\u2200 x \u2203 y"
264264
Whether these Unicode characters are displayed as escapes or shown as special characters depends
265265
on your terminal's locale settings and its support for Unicode. String literals are encoded using
266266
the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded
267-
in the same number of bytes. In UTF-8, ASCII characters -- i.e. those with code points less than
267+
in the same number of bytes ("code units"). In UTF-8, ASCII characters i.e. those with code points less than
268268
0x80 (128) -- are encoded as they are in ASCII, using a single byte, while code points 0x80 and
269-
above are encoded using multiple bytes -- up to four per character. This means that not every
270-
byte index into a UTF-8 string is necessarily a valid index for a character. If you index into
269+
above are encoded using multiple bytes — up to four per character.
270+
271+
String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that
272+
are used to encode arbitrary characters (code points). This means that not every
273+
index into a `String` is necessarily a valid index for a character. If you index into
271274
a string at such an invalid byte index, an error is thrown:
272275

273276
```jldoctest unicodestring
@@ -345,6 +348,25 @@ x
345348
y
346349
```
347350

351+
If you need to obtain valid indices for a string, you can use the [`nextind`](@ref) and
352+
[`prevind`](@ref) functions to increment/decrement to the next/previous valid index, as mentioned above.
353+
You can also use the [`eachindex`](@ref) function to iterate over the valid character indices:
354+
```jldoctest unicodestring
355+
julia> collect(eachindex(s))
356+
7-element Array{Int64,1}:
357+
1
358+
4
359+
5
360+
6
361+
7
362+
10
363+
11
364+
```
365+
366+
To access the raw code units (bytes for UTF-8) of the encoding, you can use the [`codeunit(s,i)`](@ref)
367+
function, where the index `i` runs consecutively from `1` to [`ncodeunits(s)`](@ref). The [`codeunits(s)`](@ref)
368+
function returns an `AbstractVector{UInt8}` wrapper that lets you access these raw codeunits (bytes) as an array.
369+
348370
Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
349371
For example, the [LegacyStrings.jl](https://github.com/JuliaArchive/LegacyStrings.jl) package
350372
implements `UTF16String` and `UTF32String` types. Additional discussion of other encodings and

0 commit comments

Comments
 (0)