Unicode

Strict UTF-8 and UTF-16LE validation, decoding, encoding, and transcoding.

std.unicode provides strict Unicode primitives. The current public modules are std.unicode.utf8 and std.unicode.utf16le.

String and str text APIs such as String.length() and str.length remain non-fallible for trusted text and count Unicode scalar values. String and str reject embedded NUL bytes. Use std.unicode when invalid input or byte-boundary failures should be reported with explicit errors.

UTF-8

UTF-8 functions accept bytes.Bytes, so inputs are length-aware and can represent arbitrary byte buffers, including buffers that contain NUL bytes. Owned text conversion still rejects malformed UTF-8 and embedded NUL bytes.

import std.bytes;
import std.unicode.utf8;

fn main() {
    let data = bytes.Bytes("A\u{e9}\u{20ac}\u{10348}");

    assert utf8.valid(data);
    assert (try utf8.count(data)) == 4;

    let first = try utf8.decode(data, 0);
    assert first.value == 65 as u32;
    assert first.size == 1;
}

utf8.decode(data, offset) returns Decode throws(UnicodeError), where Decode.value is the Unicode scalar value as u32 and Decode.size is the number of UTF-8 bytes consumed.

Encode UTF-8

Use utf8.encode(codepoint, out) to encode one Unicode scalar into a caller-owned byte buffer.

import std.mem;
import std.unicode.utf8;

fn main() {
    let out = [0 as u8, 0 as u8, 0 as u8, 0 as u8];
    let n = try utf8.encode(0x1f600 as u32,
                            mem.slice_of<u8>(out.ptr, out.length));

    assert n == 4;
}

Invalid scalar values return UnicodeError.Invalid. Too-small output buffers return UnicodeError.Buffer.

UTF-8 To UTF-16LE

utf8.to_utf16le(data) validates and transcodes to an owned Array<u16> throws(UnicodeError | AllocError). Use utf8.to_utf16le_into(data, out) when the destination buffer is owned by the caller.

import std.bytes;
import std.unicode.utf8;

fn main() {
    let units = try utf8.to_utf16le(bytes.Bytes("A\u{10348}"));
    let view = units.slice();

    assert view.length == 3;
    assert view[0] == 0x0041 as u16;
    assert view[1] == 0xd800 as u16;
    assert view[2] == 0xdf48 as u16;
}

UTF-16LE To UTF-8

utf16le.valid(data) checks that a [u16] view contains valid UTF-16LE. utf16le.to_utf8(data) validates and returns an owned String throws(UnicodeError | AllocError), so input that would transcode to a NUL-containing Zynx text value reports UnicodeError.Invalid. Use utf16le.to_utf8_into(data, out) when arbitrary Unicode byte output, including NUL, should remain at the byte boundary.

import std.mem;
import std.unicode.utf16le;

fn main() {
    let raw = [65 as u16, 0x00e9 as u16];
    let text = try utf16le.to_utf8(mem.slice_of<u16>(raw.ptr, raw.length));

    assert text.size() == 3;
}

API Reference

std.unicode re-exports utf8, utf16le, Decode, and UnicodeError.

ItemPurpose
UnicodeError.InvalidInvalid UTF-8, invalid UTF-16LE, or invalid scalar value.
UnicodeError.BufferCaller-provided output storage is too small.
Decode { value, size }One decoded UTF-8 scalar and byte length.
utf8.valid(data)Return whether a byte view is valid UTF-8.
utf8.count(data)Count Unicode scalar values in valid UTF-8.
utf8.decode(data, offset)Decode one UTF-8 scalar at a byte offset.
utf8.encode(codepoint, out)Encode one scalar into a byte buffer.
utf8.to_utf16le(data)Transcode UTF-8 to an owned Array<u16> throws(UnicodeError | AllocError).
utf8.to_utf16le_into(data, out)Transcode UTF-8 into caller-owned [u16].
utf16le.valid(data)Return whether a [u16] view is valid UTF-16LE.
utf16le.to_utf8(data)Transcode UTF-16LE to an owned String throws(UnicodeError | AllocError).
utf16le.to_utf8_into(data, out)Transcode UTF-16LE into caller-owned [u8].