Unicode
Strict UTF-8 and UTF-16LE validation, decoding, encoding, and transcoding.
std.unicode provides strict Unicode primitives. The current public modules
are std.unicode.utf8 and std.unicode.utf16le.
String and str text APIs such as String.length() and str.length remain
non-fallible for trusted text and count Unicode scalar values. String and
str reject embedded NUL bytes. Use std.unicode when invalid input or
byte-boundary failures should be reported with explicit errors.
UTF-8 functions accept bytes.Bytes, so inputs are length-aware and can
represent arbitrary byte buffers, including buffers that contain NUL bytes.
Owned text conversion still rejects malformed UTF-8 and embedded NUL bytes.
import std.bytes;
import std.unicode.utf8;
fn main() {
let data = bytes.Bytes("A\u{e9}\u{20ac}\u{10348}");
assert utf8.valid(data);
assert (try utf8.count(data)) == 4;
let first = try utf8.decode(data, 0);
assert first.value == 65 as u32;
assert first.size == 1;
}
utf8.decode(data, offset) returns Decode throws(UnicodeError), where
Decode.value is the Unicode scalar value as u32 and Decode.size is the
number of UTF-8 bytes consumed.
Use utf8.encode(codepoint, out) to encode one Unicode scalar into a
caller-owned byte buffer.
import std.mem;
import std.unicode.utf8;
fn main() {
let out = [0 as u8, 0 as u8, 0 as u8, 0 as u8];
let n = try utf8.encode(0x1f600 as u32,
mem.slice_of<u8>(out.ptr, out.length));
assert n == 4;
}
Invalid scalar values return UnicodeError.Invalid. Too-small output buffers
return UnicodeError.Buffer.
utf8.to_utf16le(data) validates and transcodes to an owned
Array<u16> throws(UnicodeError | AllocError).
Use utf8.to_utf16le_into(data, out) when the destination buffer is owned by
the caller.
import std.bytes;
import std.unicode.utf8;
fn main() {
let units = try utf8.to_utf16le(bytes.Bytes("A\u{10348}"));
let view = units.slice();
assert view.length == 3;
assert view[0] == 0x0041 as u16;
assert view[1] == 0xd800 as u16;
assert view[2] == 0xdf48 as u16;
}
utf16le.valid(data) checks that a [u16] view contains valid UTF-16LE.
utf16le.to_utf8(data) validates and returns an owned
String throws(UnicodeError | AllocError), so input that would transcode to a
NUL-containing Zynx text value reports UnicodeError.Invalid. Use
utf16le.to_utf8_into(data, out) when arbitrary Unicode byte output, including
NUL, should remain at the byte boundary.
import std.mem;
import std.unicode.utf16le;
fn main() {
let raw = [65 as u16, 0x00e9 as u16];
let text = try utf16le.to_utf8(mem.slice_of<u16>(raw.ptr, raw.length));
assert text.size() == 3;
}
std.unicode re-exports utf8, utf16le, Decode, and UnicodeError.
| Item | Purpose |
|---|---|
UnicodeError.Invalid | Invalid UTF-8, invalid UTF-16LE, or invalid scalar value. |
UnicodeError.Buffer | Caller-provided output storage is too small. |
Decode { value, size } | One decoded UTF-8 scalar and byte length. |
utf8.valid(data) | Return whether a byte view is valid UTF-8. |
utf8.count(data) | Count Unicode scalar values in valid UTF-8. |
utf8.decode(data, offset) | Decode one UTF-8 scalar at a byte offset. |
utf8.encode(codepoint, out) | Encode one scalar into a byte buffer. |
utf8.to_utf16le(data) | Transcode UTF-8 to an owned Array<u16> throws(UnicodeError | AllocError). |
utf8.to_utf16le_into(data, out) | Transcode UTF-8 into caller-owned [u16]. |
utf16le.valid(data) | Return whether a [u16] view is valid UTF-16LE. |
utf16le.to_utf8(data) | Transcode UTF-16LE to an owned String throws(UnicodeError | AllocError). |
utf16le.to_utf8_into(data, out) | Transcode UTF-16LE into caller-owned [u8]. |