Unicode in POSIX C

David 2020-06-21

The other day I was watching an old video of Brian Kernighan explaining UNIX and there's a bit where he talks about building a "spell-checker" with a shell pipeline. This made me wonder if there were a pipeline-able UNIX-y utility to determine the closest match to a given, possibly misspelled, word among a list of words. Basically a UNIX-style "filter" where one can pipe-in or write-in a list of words and also a mis-typed word and determine the closest match. I figured this would use the edit distance to figure it out. After a bit of searching, I couldn't really find something like this, so I decided to try and write it. The result that I came up with was a little utility called dym which stands for "did you mean...?"

The dym program takes a list of strings and an input string and tries to find the closest match to that input string in the list of strings, based on the Levenshtein distance. I have since added the ability for it to use the Damerau-Levenshtein distance. I wanted to add a command-line switch (specifically: -i) to make the matching case-insensitive, ala grep -i. Initially I used the C library function tolower() to convert everything to lowercase, but I realized this would only work for ASCII text and, ideally, I wanted dym to work with any text, even if it were Unicode.

Trying to figure out how to do that, I searched for things like "Unicode in C" or "tolower Unicode C" and such. What I found was either not very helpful, or recommended that I simply use the ICU library. For most projects, using ICU is probably "the right thing to do" but I wanted to see what I could get done without resorting to depending upon a new library. I was fairly sure that there was a more difficult but possibly more readily available way to work with Unicode in C, so I went looking for an example that I could use to learn from.

Since grep -i had inspired this idea, I figured that grep itself was a good candidate to use as the example. I also knew that the NetBSD operating system was famously portable and likely used only C that would work on most UNIX-like systems, so I decided to look at NetBSD's grep. First, I went and found out how to get the source code to NetBSD, which is explained in the NetBSD guide. I've since discovered that there is also a mirror of the NetBSD source on GitHub.

Once I got the source code, I had to find the grep source. I knew from having looked at BSD source code before that it was probably in /usr/src/usr.bin/grep, which it was. Then, I used ripgrep to search for wchar_t which I knew was a C Unicode ("wide") character. Sure enough, I found a hit in fastgrep.c.

Here I discovered the mbstowcs function, for which a man page existed on my system. Using the man page and the NetBSD source, I was able to figure out how to convert a string of bytes—i.e. a char*— to a string of wchar_t and back again (via wcstombs). Through trial-and-error I played around with things until I figured out how to get it to work and how to change the case of the Unicode string via towlower.

The end result of all of this is in dym/lowercase.c and now dym words (fairly well) with Unicode strings just as it does with ASCII strings. Here is an annotated version of some of the code:

/* Calling mbstowcs on NULL and copying 0 bytes will just return the number of
 * wchar_t characters that would be used, minus the trailing NULL whcar_t. */
size_t wstr_size = mbstowcs(NULL, str, 0);
wstr = malloc(sizeof(wchar_t) * (wstr_size + 1));

/* Convert the "wide" characters to their lowercase "equivalent" */
for (i = 0; i <= wstr_size; i++) {
	wstr[i] = towlower(wstr[i]);
}

/* Convert the wide-character string back to a byte/char string */
wcstombs(str, wstr, len);
str[len] = '\0';
free(wstr);
}

The next time you're trying to figure out how to do something in C, I highly recommend looking at how any of the BSDs (NetBSD, OpenBSD, or FreeBSD) did something similar because the code tends to be very readable.