-
Notifications
You must be signed in to change notification settings - Fork 69
Unicode
The overall idea here is:
- Make sure we have a distinction between text and a sequence of bytes. Text is a
<string>
or<unicode-string>
. A sequence of bytes is a<byte-vector>
or a<buffer>
. -
<unicode-string>
should be a defined and only a single encoding, eitherUTF-8
orUCS4
(to be determined). - Sequences of bytes have an encoding associated with them (even if only through intent and not structurally). They must be decoded to get a
<string>
. -
<byte-string>
should probably die.
- You can currently copy between a
<byte-vector>
and a<string>
(usingcopy-bytes
), but it isn't clear that should be possible. - Some things are specialized on
<byte-string>
which ideally is going away. - Is the I/O functionality streams and files sufficient to deal with text and byte buffers being distinct things?
- What impact will this have on the C-FFI? (Which deals with
<C-string>
classes now.)
- Source files and LID files should be defined to be
UTF-8
encoded. We should not support alternate encodings for source text.
While we work out the above, there's a good bit of work that can be done initially.
-
<unicode-integer>
should become an<integer>
(or at least something of the right size rather than<double-byte>
. - We are using tag 3 for
<unicode-character>
. Each of the compiler backends and runtimes needs to be aware of this and be double checked for correctness. (This includes verifying things like the implementation ofprimitive-unicode-character-as-raw
andprimitive-raw-as-unicode-chracter
.) - Evaluate the impact of the compiler not being aware of the
<unicode-string>
in the way that it is aware of<byte-string>
. What optimizations are missing due to this? - Work on the
unicode-data-generator
, in particular, issues identified insources/app/unicode-data-generator/TODO
. - Determine what Unicode functionality needs to be present in the core runtime and libraries to implement the functionality required by the DRM. (Things like uppercase, lowercase.)
- Figure out what to do about improved case handling, like having title case alongside the existing uppercase and lowercase code.
- Make
<unicode-character>
be limited to a 32 bit sized value where we can rather than word-sized. (This is important for 64 bit platforms, but is less important in the short term than just getting the algorithms working.) - Implement the Unicode algorithms in the
strings
library and the core runtime as appropriate. We can look at some code from Common Lisp that is being done for the GSOC this year. See notes below about this. - Improve our test coverage of Unicode stuff. (This can partially borrow from the GSOC work below.)
- Figure out what encoders and decoders should look like. Write a UTF-8 encoder / decoder. Write other encoders (like UTF-16). See additional notes below.
- Make streams work well with encoders. (Not sure what that means.)
The work being done for the SBCL / Unicode project for GSOC (2014) is currently in https://github.com/krzysz00/sbcl/tree/unicode-algorithms. The important files are tools-for-build/ucd.lisp
, src/code/target-{char,unicode}.lisp
and tests/unicode*
.
When we limit the size of <unicode-character>
to 32 bits, we'll have to revisit some code that deals with repeated slots and limited vectors.
In the HARP backend, there is some code like this:
let op--slot-element = select(repeated-representation-size(type)) 1 => op--byte-element; 2 => op--double-byte-element; otherwise => op--repeated-slot-element; end select;
We'll need to fix that and look for similar code and issues in the generic DFMC code as well as the C and LLVM backends.
We may also have to update the implementations of primitive-unicode-character-as-raw
to extend the value to a word sized value for the raw object. (See the LLVM implementation of primitive-byte-character-as-raw
.)
These should support translating between <character>
/ <string>
and <byte-vector>
. However, there are some other concerns:
- Some things use
<buffer>
rather than<byte-vector>
. Does that matter? - We may (eventually?) need
-into!
variants to reduce data copying. - Streams should have a single encoding.
- We should default many things to UTF-8.