In Part 1 and Part 2 of this series, we examined and scrutinized the ideographs that are tagged “K” (for ROK or South Korea), “P” (for DPRK or North Korea), and “J” (for Japan) in the kIICore property. In Part 3, which is today’s article, we will explore the 5,825 ideographs that are tagged “G” (for PRC or China).
The good news is that all of the ideographs that are included in the most common sets for China—the first 3,500 ideographs in 通用规范汉字表 (Tōngyòng Guīfàn Hànzìbiǎo or TGH 2013) and the 3,755 ideographs of GB 2312 Level 1—are tagged “G” in IICore. When I merged these two sets, which resulted in 3,874 unique ideographs, 1,951 are not accounted for.
When I explored the next most important sets of ideographs for China, I found that 1,787 of the remaining 1,951 ideographs are in the second set of ideographs of 通用规范汉字表 (3,000), and 1,771 of them are among the 3,008 ideographs of GB 2312 Level 2. When merged, these two sets resulted in accounting for 1,847 ideographs of the remaining 1,951 ones, meaning that 104 are still not accounted for.
Finally, I found that 75 of the remaining 104 ideographs are in the third set of ideographs of 通用规范汉字表 (1,605), which leaves a mere 29 unaccounted for. The tables below lists these 29 remaining ideographs, separated by kIRG_GSource source prefix:
|kIRG_GSource—GB 7589 unsimplified forms
|kIRG_GSource—GB 7590 unsimplified forms
* = There is an issue with U+6673 晳 and U+6FEC 濬 in that the actual GB 8565.2 standard does not include characters at code points 0x2D72 (13-82) or 0x2D59 (13-57). These ideographs are actually present in ISO-IR-165 at those code points. See Jaemin Chung’s IRG N2276 for more details.
Below is a modified version of the fifth table, which includes the five ideographs whose source references use the “GE” prefix, and which adds other source references from other properties. GB/T 16500 is interesting in a couple of ways. First and foremost, its 3,778 ideographs are simply meant to “fill in” URO (Unified Repertoire & Ordering) code points that otherwise lacked a kIRG_GSource property value, so they are effectively GBK characters. Second, as this tweet reports, the first two hexadecimal digits of all 3,778 source references are low by exactly 0x0F, and the source references in the table below reflect the corrections.
|Other Source References
|HB1-AB6F, J0-524A, KP1-38C9, K1-5730, T1-5033, V1-4D7A
|HB2-CBFA, J14-2468, KP0-D0EB, K0-4F26, T2-257A, V0-3438
|HB1-B841, J14-7227, KP1-5E72, K2-4B4C, T1-6548
|HB2-F0F9, J0-6D28, KP0-EDA4, K0-7432, T2-6364
|HB1-B6A2, KP0-F2D8, K0-7959, T1-6267, V2-907C
The fact that these five ideographs are tagged “G” in IICore is interesting, because on one hand their presence in the GB/T 16500 standard may suggest that they are not actually used in China, but on the other hand, they may actually be used in some specific contexts. At least, they are tagged with not only “G,” but with at least one or more additional tags.
Stay tuned for Part 4 of this series…