“Bahts is [not] parts”

—Mistakes happen—

—Humans make mistakes—

—Anything made by humans has the potential to include mistakes—

The most important things about mistakes are that 1) we recognize them, lest they propagate; 2) we learn from them; 3) we make an effort not to repeat them; and 4) we try to fix them, if possible.

Some mistakes are more easily fixed than others. Mistakes that cannot be fixed must be worked around.

With that said, an interesting event of historical significance occurred in June of 2000:

A former co-worker of mine in Type QE (Quality Engineering), David Kelly, observed that Unicode’s representative glyph for U+332C ㌬ SQUARE PAATU didn’t match the Adobe-Japan1-4 glyph (CID+11914) to which it was mapped in that the first character of this square katakana ligature was U+30D0 (バ), not U+30D1 (パ). After some discussion with Unicode principals, such as Ken Whistler and Joe Becker, this was eventually determined to be an error from the very beginning of Unicode, as U+0E3F ฿ THAI CURRENCY SYMBOL BAHT and U+332C ㌬ SQUARE PAATU were both included in Version 1.0 (1991). Unfortunately, it was too late to make a correction, due to stability guarantees that are still in place today. Neither the character’s name nor its decomposition (U+30D1 U+30FC U+30C4) could be changed.

How do we know that this was an error? If one looks at the purpose of the square katakana ligatures, they generally fall into one of two categories: currency symbols or units of measurement. If one considers “parts” as a unit of measurement, it would be represented as パート (pāto), not パーツ (pātsu). Besides, a square katakana ligature for the baht currency symbol is strangely absent, though it was added in Adobe-Japan1-4 (February of 2000).

This explains why I nuked the mapping for U+332C (Adobe-Japan1-4 CID+11914) from the UniJIS-UCS2-H CMap resource in July of 2000. That CMap resource was eventually deprecated at the end of 2002 in favor of the synchronized UTF-8, UTF-16, and UTF-32 CMap resources, and its mappings haven’t been modified since.

The glyph that corresponds to the intended form of U+332C (CID+11914), along with its vertical version (CID+11998), are present in OpenType/CFF Japanese fonts whose glyph set is Adobe-Japan1-4 or higher, and is accessible via the 'dlig' (Discretionary Ligatures) GSUB feature by entering the sequence バーツ (U+30D0 U+30FC U+30C4), and the vertical version is accessible via the 'vert' (Vertical Writing) GSUB feature. Both glyphs are shown below, rendered using Kozuka Gothic Pr6N Medium (小塚ゴシック Pr6N M):

The end result is that Adobe-Japan1-4 and greater fonts lack a glyph for U+332C, as shown in the excerpt below that is rendered using Kozuka Mincho Pr6N Regular (小塚明朝 Pr6N R):

The same is true for Source Han Sans and Noto Sans CJK, which lack a glyph for U+332C, and also lack glyphs that correspond to Adobe-Japan1-6 CIDs 11914 and 11998:

Interestingly, the only fonts on my computer that include a glyph for U+332C are Arial Unicode MS, Meiryo, Meiryo Italic, Meiryo Bold, Meiryo Bold Italic, MS Gothic, MS PGothic, MS Mincho, and MS PMincho. I am also told that GNU Unifont and Code2000 are fonts that also include a glyph for this character.

I am sometimes asked about the absence of a glyph for U+332C, and the purpose of this article is to provide a definitive answer. In other words, U+332C ㌬ SQUARE PAATU is a so-called ghost character (幽霊文字 yūrei moji in Japanese), and is treated as a mistake that cannot be corrected, which led to a workaround that involved removing the mapping for U+332C and to instead make the glyph for the intended character accessible through other means, specifically via the 'dlig' GSUB feature.

🐡

P.S. Click on the image at the top of this article to get an explanation for its title.

CJK Type Blog

CJK Fonts, Character Sets & Encodings.

“Bahts is [not] parts”

By Dr. Ken Lunde

Comments (0)

Created