@@ -179,12 +179,12 @@ julia> str[end]
179
179
```
180
180
181
181
Many Julia objects, including strings, can be indexed with integers. The index of the first
182
- element is returned by [ ` firstindex(str) ` ] ( @ref ) , and the index of the last element
182
+ element (the first character of a string) is returned by [ ` firstindex(str) ` ] ( @ref ) , and the index of the last element (character)
183
183
with [ ` lastindex(str) ` ] ( @ref ) . The keyword ` end ` can be used inside an indexing
184
184
operation as shorthand for the last index along the given dimension.
185
- Most indexing in Julia is 1-based: the first element of many integer-indexed objects is found at
186
- index 1. ( As we will see below, this does not necessarily mean that the last element is found
187
- at index ` n ` , where ` n ` is the length of the string.)
185
+ String indexing, like most indexing in Julia is 1-based: ` firstindex ` always returns ` 1 ` for any ` AbstractString ` .
186
+ As we will see below, however, ` lastindex(str) ` is * not* in general the same as ` length(str) ` for a string,
187
+ because some Unicode characters can occupy multiple "code units".
188
188
189
189
You can perform arithmetic and other operations with ` end ` , just like
190
190
a normal value:
@@ -264,10 +264,13 @@ julia> s = "\u2200 x \u2203 y"
264
264
Whether these Unicode characters are displayed as escapes or shown as special characters depends
265
265
on your terminal's locale settings and its support for Unicode. String literals are encoded using
266
266
the UTF-8 encoding. UTF-8 is a variable-width encoding, meaning that not all characters are encoded
267
- in the same number of bytes. In UTF-8, ASCII characters -- i.e. those with code points less than
267
+ in the same number of bytes ("code units") . In UTF-8, ASCII characters — i.e. those with code points less than
268
268
0x80 (128) -- are encoded as they are in ASCII, using a single byte, while code points 0x80 and
269
- above are encoded using multiple bytes -- up to four per character. This means that not every
270
- byte index into a UTF-8 string is necessarily a valid index for a character. If you index into
269
+ above are encoded using multiple bytes — up to four per character.
270
+
271
+ String indices in Julia refer to code units (= bytes for UTF-8), the fixed-width building blocks that
272
+ are used to encode arbitrary characters (code points). This means that not every
273
+ index into a ` String ` is necessarily a valid index for a character. If you index into
271
274
a string at such an invalid byte index, an error is thrown:
272
275
273
276
``` jldoctest unicodestring
345
348
y
346
349
```
347
350
351
+ If you need to obtain valid indices for a string, you can use the [ ` nextind ` ] ( @ref ) and
352
+ [ ` prevind ` ] ( @ref ) functions to increment/decrement to the next/previous valid index, as mentioned above.
353
+ You can also use the [ ` eachindex ` ] ( @ref ) function to iterate over the valid character indices:
354
+ ``` jldoctest unicodestring
355
+ julia> collect(eachindex(s))
356
+ 7-element Array{Int64,1}:
357
+ 1
358
+ 4
359
+ 5
360
+ 6
361
+ 7
362
+ 10
363
+ 11
364
+ ```
365
+
366
+ To access the raw code units (bytes for UTF-8) of the encoding, you can use the [ ` codeunit(s,i) ` ] ( @ref )
367
+ function, where the index ` i ` runs consecutively from ` 1 ` to [ ` ncodeunits(s) ` ] ( @ref ) . The [ ` codeunits(s) ` ] ( @ref )
368
+ function returns an ` AbstractVector{UInt8} ` wrapper that lets you access these raw codeunits (bytes) as an array.
369
+
348
370
Julia uses the UTF-8 encoding by default, and support for new encodings can be added by packages.
349
371
For example, the [ LegacyStrings.jl] ( https://github.com/JuliaArchive/LegacyStrings.jl ) package
350
372
implements ` UTF16String ` and ` UTF32String ` types. Additional discussion of other encodings and
0 commit comments