HCF.LOL - Unicode encoding in C from scratch

Unicode encoding in C from scratch

This blog post will dive deep into the intricacies of printing Unicode characters in C, with a special focus on the fire emoji (U+1F525) as our running example.

Really short history of Unicode and UTF-8

Before Unicode, various encoding systems were used for different languages and regions, leading to compatibility issues and confusion.

Unicode was developed in the late 1980s and early 1990s as a universal character encoding standard. Its goal was to provide a unique number for every character, regardless of platform, program, or language. This ambitious project aimed to unify all the world's writing systems under one standard.

UTF-8, created by Ken Thompson and Rob Pike in 1992, is a variable-width encoding that can represent every character in the Unicode standard. It has become the dominant encoding for the World Wide Web and is backward compatible with ASCII.

Byte Representation

Let's start our journey with the fire emoji 🔥, represented by the Unicode code point U+1F525.

In UTF-8, the fire emoji (U+1F525) is represented by four bytes: F0 9F 94 A5. Let's break this down:

F0: Indicates the start of a 4-byte sequence in UTF-8
9F: Continuation byte
94: Continuation byte
A5: Final continuation byte

This sequence encodes the code point 0x1F525 in UTF-8 format.

For example, if we use the bash command printf "\xF0\x9F\x94\xA5\n" in a Unix-like terminal, you're directly printing the UTF-8 byte sequence for the fire emoji 🔥. Here's why it works:

\x is an escape sequence in C-style strings that allows you to specify a byte using its hexadecimal value.
\xF0\x9F\x94\xA5 represents the four bytes of the UTF-8 encoded fire emoji.
The terminal, if configured to use UTF-8, interprets these bytes as the fire emoji and displays it.
\n adds a newline character at the end.

This method works because we are directly instructing the terminal to output specific byte values, which it then interprets as UTF-8.

Printing Unicode UTF-8 with C

When working with Unicode in C, we need to be more explicit about our intentions. C was developed before Unicode, so it doesn't have built-in support for wide character sets. However, we can still work with UTF-8 encoded strings.

To encode a Unicode code point into UTF-8, we can use a 5-byte buffer. Why 5 bytes? Because:

The maximum number of bytes needed for a UTF-8 encoded character is 4.
We need an extra byte for the null terminator '\0' to mark the end of the string in C.

Here's a simple function to demonstrate this encoding:

void encode_utf8(uint32_t codepoint, char buffer[5]) {
    if (codepoint <= 0x7F) {
        buffer[0] = codepoint;
        buffer[1] = '\0';
    } else if (codepoint <= 0x7FF) {
        buffer[0] = 0xC0 | (codepoint >> 6);
        buffer[1] = 0x80 | (codepoint & 0x3F);
        buffer[2] = '\0';
    } else if (codepoint <= 0xFFFF) {
        buffer[0] = 0xE0 | (codepoint >> 12);
        buffer[1] = 0x80 | ((codepoint >> 6) & 0x3F);
        buffer[2] = 0x80 | (codepoint & 0x3F);
        buffer[3] = '\0';
    } else if (codepoint <= 0x10FFFF) {
        buffer[0] = 0xF0 | (codepoint >> 18);
        buffer[1] = 0x80 | ((codepoint >> 12) & 0x3F);
        buffer[2] = 0x80 | ((codepoint >> 6) & 0x3F);
        buffer[3] = 0x80 | (codepoint & 0x3F);
        buffer[4] = '\0';
    } else {
        // Invalid codepoint
        buffer[0] = '\0';
    }
}

This function takes a Unicode code point and encodes it into UTF-8 in the provided buffer.

The ASCII Case

ASCII characters (0-127) are a subset of UTF-8 and are encoded using a single byte. In our encoding function, this is handled by the first condition:

if (codepoint <= 0x7F) {
    buffer[0] = codepoint;
    buffer[1] = '\0';
}

This simplicity is one of the reasons UTF-8 gained widespread adoption – it's fully compatible with ASCII.

The Surrogate Pairs for UTF-16

UTF-16, another Unicode encoding, uses surrogate pairs to represent code points above U+FFFF. These are code points in the range U+D800 to U+DFFF. In UTF-8, we don't use surrogate pairs, but we need to be aware of them to avoid encoding invalid sequences.

In our encoding function, we should add a check for surrogate pairs:

if (codepoint >= 0xD800 && codepoint <= 0xDFFF) {
    // Invalid codepoint, use replacement character
    encode_utf8(0xFFFD, buffer);
    return;
}

The Replacement Character U+FFFD

The replacement character � (U+FFFD) is used to indicate problems when rendering or displaying text. It's commonly used when an unknown, unrecognized, or unrepresentable character is encountered.

In our encoding function, we can use this character when we encounter invalid code points:

if (codepoint > 0x10FFFF) {
    // Invalid codepoint, use replacement character
    encode_utf8(0xFFFD, buffer);
    return;
}

Putting All Together

Now, let's look at a complete implementation that handles all these cases:

#include "stdio.h"
#include "stdint.h"
#include "string.h"

char* codepoint_to_utf8(uint32_t codepoint, char buffer[5]) {
    // UTF-8 can encode Unicode codepoints in the range 0x0 to 0x10FFFF
    // We'll use the replacement character U+FFFD for any invalid input

    if (codepoint <= 0x7F) {
        // 1-byte sequence: ASCII codepoints 0-127
        // Format: 0xxxxxxx
        buffer[0] = (char)codepoint;
        buffer[1] = '\0';
    } else if (codepoint <= 0x7FF) {
        // 2-byte sequence
        // Format: 110xxxxx 10xxxxxx
        buffer[0] = (char)(0xC0 | (codepoint >> 6));
        buffer[1] = (char)(0x80 | (codepoint & 0x3F));
        buffer[2] = '\0';
    } else if (codepoint <= 0xFFFF) {
        // Check for surrogate pair range (U+D800 to U+DFFF)
        // These are reserved for UTF-16 and invalid in UTF-8
        if (codepoint >= 0xD800 && codepoint <= 0xDFFF) {
            // Use replacement character U+FFFD
            return codepoint_to_utf8(0xFFFD, buffer);
        }
        // 3-byte sequence
        // Format: 1110xxxx 10xxxxxx 10xxxxxx
        buffer[0] = (char)(0xE0 | (codepoint >> 12));
        buffer[1] = (char)(0x80 | ((codepoint >> 6) & 0x3F));
        buffer[2] = (char)(0x80 | (codepoint & 0x3F));
        buffer[3] = '\0';
    } else if (codepoint <= 0x10FFFF) {
        // 4-byte sequence
        // Format: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
        buffer[0] = (char)(0xF0 | (codepoint >> 18));
        buffer[1] = (char)(0x80 | ((codepoint >> 12) & 0x3F));
        buffer[2] = (char)(0x80 | ((codepoint >> 6) & 0x3F));
        buffer[3] = (char)(0x80 | (codepoint & 0x3F));
        buffer[4] = '\0';
    } else {
        // Invalid codepoint (above U+10FFFF)
        // Use replacement character U+FFFD
        return codepoint_to_utf8(0xFFFD, buffer);
    }

    return buffer;
}

void print_sokoban_map(const char* map[], int height, int width) {
    char utf8_buffer[5];

    // Define Unicode codepoints for Sokoban elements
    const uint32_t WALL = 0x1f332;  //0x1F9F1;      // Full block
    const uint32_t PLAYER = 0x1F6B6;   // Person walking
    const uint32_t BOX = 0x1F4E6;      // Package
    const uint32_t TARGET = 0x1F6A9;    // Black large circle
    const uint32_t FLOOR = 0x3000;     // Middle dot

    for (int y = 0; y < height; y++) {
        for (int x = 0; x < width; x++) {
            uint32_t codepoint;
            switch (map[y][x]) {
                case '#': codepoint = WALL; break;
                case '@': codepoint = PLAYER; break;
                case '$': codepoint = BOX; break;
                case '.': codepoint = TARGET; break;
                case ' ': codepoint = FLOOR; break;
                default: codepoint = FLOOR; break;
            }

            codepoint_to_utf8(codepoint, utf8_buffer);
            printf("%s", utf8_buffer);

        }
        printf("\n");
    }
}

int main() {
    // Test cases including edge cases and invalid inputs
    uint32_t test_cases[] = {
        0x24,       // U+0024 DOLLAR SIGN
        0xA3,       // U+00A3 POUND SIGN
        0x939,      // U+0939 DEVANAGARI LETTER HA
        0xD800,     // Invalid: surrogate pair
        0x1F525,    // U+1F525 FIRE EMOJI
        0x10FFFF,   // Highest valid codepoint
        0x110000    // Invalid: above U+10FFFF
    };
    int num_cases = sizeof(test_cases) / sizeof(test_cases[0]);
    char utf8_buffer[5];

    for (int i = 0; i < num_cases; i++) {
        uint32_t codepoint = test_cases[i];
        codepoint_to_utf8(codepoint, utf8_buffer);

        printf("UTF-8 for U+%X: ", codepoint);

        // Print each byte in hexadecimal
        for (int j = 0; utf8_buffer[j] != '\0'; j++) {
            printf("%02X ", (unsigned char)utf8_buffer[j]);
        }

        printf("(");
        // Print the actual UTF-8 string
        printf("%s", utf8_buffer);
        printf(")\n");
    }

    printf("\n\n\n");

    const char* sokoban_map[] = {
        "    #####          ",
        "    #   #          ",
        "    #$  #          ",
        "  ###  $##         ",
        "  #  $ $ #         ",
        "### # ## #   ######",
        "#   # ## #####  ..#",
        "# $  $          ..#",
        "##### ### #@##  ..#",
        "    #     #########",
        "    #######        "
    };

    int height = sizeof(sokoban_map) / sizeof(sokoban_map[0]);
    int width = strlen(sokoban_map[0]);

    print_sokoban_map(sokoban_map, height, width);

    return 0;
}

This code demonstrates how to handle various Unicode code points, including edge cases and invalid inputs. It correctly encodes ASCII characters, multi-byte sequences, and handles surrogate pairs and out-of-range code points by using the replacement character.

TL;DR

Printing Unicode characters in C requires understanding UTF-8 encoding. Here are the key points:

Unicode assigns a unique number (code point) to every character.
UTF-8 is a variable-width encoding that can represent all Unicode code points.
In C, we can encode Unicode code points into UTF-8 using bitwise operations.
ASCII characters (0-127) are encoded as single bytes in UTF-8.
Multi-byte sequences in UTF-8 follow a specific pattern with leading and continuation bytes.
Surrogate pairs (U+D800 to U+DFFF) are invalid in UTF-8 and should be replaced.
The replacement character (U+FFFD) is used for invalid or unrepresentable code points.
A 5-byte buffer is sufficient to encode any valid Unicode code point in UTF-8 (4 bytes max + null terminator).

Special thanks to @zoriya_dev for the kind feedback and support.

Next post... Building Sokoban game on the terminal from scratch in C