Unicode in POSIX C
David 2020-06-21
The other day I was watching an old video of Brian Kernighan explaining UNIX and there's a bit where he talks about building a "spell-checker" with a shell pipeline. This made me wonder if there were a pipeline-able UNIX-y utility to determine the closest match to a given, possibly misspelled, word among a list of words. Basically a UNIX-style "filter" where one can pipe-in or write-in a list of words and also a mis-typed word and determine the closest match. I figured this would use the edit distance to figure it out. After a bit of searching, I couldn't really find something like this, so I decided to try and write it. The result that I came up with was a little utility called dym which stands for "did you mean...?"
The dym
program takes a list of strings and an input string and tries to find
the closest match to that input string in the list of strings, based on the
Levenshtein distance. I
have since added the ability for it to use the Damerau-Levenshtein distance.
I wanted to add a command-line switch (specifically: -i
) to make the matching
case-insensitive, ala grep -i
. Initially I used the C library function
tolower()
to convert everything to lowercase, but I realized this would only
work for ASCII text and, ideally, I wanted dym
to work with any text, even if
it were Unicode.
Trying to figure out how to do that, I searched for things like "Unicode in C" or "tolower Unicode C" and such. What I found was either not very helpful, or recommended that I simply use the ICU library. For most projects, using ICU is probably "the right thing to do" but I wanted to see what I could get done without resorting to depending upon a new library. I was fairly sure that there was a more difficult but possibly more readily available way to work with Unicode in C, so I went looking for an example that I could use to learn from.
Since grep -i
had inspired this idea, I figured that grep
itself was a good
candidate to use as the example. I also knew that the NetBSD
operating system was famously portable and likely used only C that would work
on most UNIX-like systems, so I decided to look at NetBSD's grep
. First, I
went and found out how to get the source code to NetBSD, which is explained
in the NetBSD guide.
I've since discovered that there is also a mirror
of the NetBSD source on GitHub.
Once I got the source code, I had to find the grep
source. I knew from having
looked at BSD source code before that it was probably in
/usr/src/usr.bin/grep
, which it was. Then, I used ripgrep
to search for wchar_t
which I knew was a C Unicode ("wide") character. Sure
enough, I found a hit in fastgrep.c.
Here I discovered the mbstowcs
function, for which a man page
existed on my system. Using the man page and the NetBSD source, I was able to
figure out how to convert a string of bytes—i.e. a char*
— to a string of
wchar_t
and back again (via wcstombs
). Through trial-and-error I played
around with things until I figured out how to get it to work and how to change
the case of the Unicode string via towlower.
The end result of all of this is in dym/lowercase.c
and now dym
words (fairly well) with Unicode strings just as it does with
ASCII strings. Here is an annotated version of some of the code:
/* Calling mbstowcs on NULL and copying 0 bytes will just return the number of
* wchar_t characters that would be used, minus the trailing NULL whcar_t. */
size_t wstr_size = mbstowcs(NULL, str, 0);
wstr = malloc(sizeof(wchar_t) * (wstr_size + 1));
/* Convert the "wide" characters to their lowercase "equivalent" */
for (i = 0; i <= wstr_size; i++) {
wstr[i] = towlower(wstr[i]);
}
/* Convert the wide-character string back to a byte/char string */
wcstombs(str, wstr, len);
str[len] = '\0';
free(wstr);
}
The next time you're trying to figure out how to do something in C, I highly recommend looking at how any of the BSDs (NetBSD, OpenBSD, or FreeBSD) did something similar because the code tends to be very readable.