These are a few of my favorite things…

I like ASCII. Do I like ASCII because of all the wonderful things one can do with its extraordinarily large repertoire of 94 printable characters? Actually, yes. Before I defend that answer, I’d like to point out that ASCII has three important strengths: simplicity, robustness, and ubiquity. In other words, ASCII is simple in that it has a relatively small number of characters; it forms a subset of virtually every encoding, Unicode or otherwise; and is supported everywhere. In fact, ASCII can be used to represent Unicode through the use of notations. Richard Ishida‘s excellent Unicode Code Converter is an excellent way to explore the various notations that are currently in use.

I really like Unicode. Unlike legacy encodings, Unicode covers a much broader collection of scripts for our world’s languages, and also defines properties that allow implementations to more intelligently support its 100K+ characters.

Speaking of Unicode, I actually like all three of its encoding forms, meaning UTF-8, UTF-16, and UTF-32. It’s difficult to select a favorite from among them, because all three are useful. UTF-8 is a superset of ASCII in that its one-byte portion is the same as ASCII. UTF-32 is the most human-readable in hexadecimal form, and is the basis of Unicode scalar values, such as U+5263 for 剣.

I like Unicode’s 16 Supplementary Planes. The BMP (Basic Multilingual Plane) is nearly full, and any new large repertoires must be added outside of it, meaning in one of the Supplementary Planes. As of Unicode Version 6.0, the number of characters outside the BMP exceeds those in the BMP. Speaking of Supplementary Planes, Plane 2 is nearly full with CJK Unified Ideographs Extensions B through D, with the nearly 6,000 characters of Extension E in the process of being added, so I predict that Plane 3 will serve as the next Supplementary Ideographic Plane. And, regardless of the Unicode encoding form, the number of bytes required to represent any of the 16 Supplementary Plane’s 1,048,544 ((65,536 − 2) × 16) characters is four. My Unicode Beyond-BMP Top Ten List should also be checked out.

I like fonts. ☺

CJK Type Blog

CJK Fonts, Character Sets & Encodings.

These are a few of my favorite things…

By Dr. Ken Lunde

Comments (0)

Created