C++ char32_t - SyntBlaze

char32_t is a fundamental, distinct character type in C++ (introduced in C++11) designed to represent a single UTF-32 character code unit. Because UTF-32 is a fixed-width encoding at the code point level, a single char32_t value directly corresponds to a single Unicode scalar value (code point).

Type Specifications

Size and Signedness: char32_t has the exact same size, signedness, and alignment requirements as std::uint_least32_t. It is an unsigned integral type guaranteed to be at least 32 bits wide. Note that sizeof(char32_t) returns the size in C++ bytes, which depends on the platform’s CHAR_BIT macro; therefore, it is not strictly guaranteed to be 4 bytes on platforms where a byte is larger than 8 bits.
Type Distinctness: Despite its underlying representation, char32_t is a distinct type. It is not a typedef for std::uint_least32_t, meaning it resolves uniquely during function overloading and template specialization.

Literal Syntax

To declare a char32_t character or string literal, prefix the literal with a capital U.

// char32_t character literals
char32_t ascii_char = U'A';
char32_t unicode_cp = U'\U0001F600'; // 32-bit Unicode escape sequence

// Array of char32_t (null-terminated)
const char32_t* raw_str = U"UTF-32 String Literal";

Standard Library Integration

The C++ Standard Library provides specialized aliases and templates to handle char32_t sequences:

#include <string>
#include <string_view>

// std::u32string is a typedef for std::basic_string<char32_t>
std::u32string str32 = U"Standard Library String";

// std::u32string_view is a typedef for std::basic_string_view<char32_t> (C++17)
std::u32string_view view32 = U"String View";

Memory Representation and Code Points

A char32_t array or std::u32string stores sequences of integers that are at least 32 bits wide. Unlike char8_t (UTF-8), which requires variable-length sequences for almost all code points (anything above U+007F), or char16_t (UTF-16), which requires variable-length surrogate pairs for code points outside the Basic Multilingual Plane (BMP), every valid Unicode code point fits into exactly one char32_t element. It is critical to distinguish between a Unicode code point and a user-perceived character (grapheme cluster). While char32_t avoids surrogate pairs, grapheme clusters containing combining diacritical marks, skin tone modifiers, or zero-width joiners (ZWJ) consist of multiple code points and will therefore require multiple char32_t elements.

// Single code point representation
std::u32string s = U"A"; 
// s.length() == 1
// The element is >= 32 bits wide (sizeof(s[0]) * CHAR_BIT >= 32)

std::u32string emoji = U"😀"; 
// emoji.length() == 1 (No surrogate pairs required for this code point)

// Multi-code point grapheme cluster (Family emoji: 👨‍👩‍👧‍👦)
std::u32string family = U"👨‍👩‍👧‍👦";
// family.length() == 7 (4 base emojis + 3 Zero-Width Joiners)
// Requires 7 char32_t elements, despite rendering as one visual character

Standard I/O Limitations

Standard C++ I/O streams do not provide built-in support for char32_t. There is no std::u32cout stream, and std::ostream lacks overloaded operator<< functions for char32_t strings or characters. Attempting to print these types directly will result in compilation errors or unintended behavior.

#include <iostream>
#include <string>

std::u32string str = U"Test";
const char32_t* raw_str = U"Test";

// std::cout << str;      // ERROR: Fails to compile (no operator<< for std::u32string)
// std::cout << raw_str;  // Compiles, but prints the pointer's memory address, not the string