Building UTF-32 CMap Resources

When using AFDKO to develop CID-keyed OpenType/CFF fonts, the most important CMap resources are the UTF-32 ones, for the following reasons:

  1. Unicode has become the de facto character encoding for today’s OSes and applications.
  2. When the font includes mappings outside the BMP (Basic Multilingual Plane), the Format 12 (UTF-32) ‘cmap‘ subtable is included. When a font includes only BMP mappings, the AFDKO makeotf tool is smart enough not to create a Format 12 ‘cmap’ subtable, and instead creates only a Format 4 (BMP-only UTF-16) one.
  3. UTF-32 is arguably the most human-readable of the Unicode Encoding Forms, because its big-endian hexadecimal representation is simply the Unicode Scalar Value without the “U+” prefix and zero-padded to eight digits.

The AFDKO makeotf tool is used to build a fully-functional font, and a UTF-32 CMap resource is specified as the argument of its “-ch” command-line option.

When developing fonts that are based on one of the public ROSes, such as Adobe-CNS1-6, Adobe-GB1-5, Adobe-Japan1-6, or Adobe-Korea1-2, you simply use the appropriate UTF-32 CMap resources that are made available in the CMap Resources open source project that is hosted at Open @ Adobe. AFDKO includes the UTF-32 CMap resources for the public ROSes, and the makeotf tool invokes them automatically, but the latest versions are always available at Open @ Adobe.

Most of the public ROSes include only one UTF-32 CMap resource, so the choice is always clear, because there is no choice. And, the makeotf tool makes this non-choice choice for you. ☺

There are, however, several UTF-32 CMap resources associated with the Adobe-Japan1-6 ROS, so the appropriate choice depends on the purpose of the font:

  • The UniJIS-UTF32-H CMap resource is recommended for JIS90-savvy fonts.
  • The UniJIS2004-UTF32-H CMap resource is recommended for JIS2004-savvy fonts.
  • The UniJISX0213-UTF32-H and UniJISX02132004-UTF32-H CMap resources correspond to what is used for the Hiragino (ヒラギノ) fonts that are bundled with Mac OS X: these differ from UniJIS-UTF32-H and UniJIS2004-UTF32-H in that the code points for 65 symbols map to proportional glyphs instead of full-width ones.

When developing CID-keyed OpenType/CFF fonts that are based on an ROS other than a public one, including the special-purpose Adobe-Identity-0 ROS, you must build your own UTF-32 CMap resource. That is the topic of this particular CJK Type Blog article.

When building your own UTF-32 CMap resource, the most important data is a mapping from Unicode code points to CIDs. As long as you have that, the process is relatively simple. As a very simple and minimal example, let’s assume that the font includes the following four glyphs:

CID Unicode Scalar Value
0 n/a (.notdef)
1 U+0020 (space)
2 U+304B (か)
3 U+304C (が)

 
Thus, U+0020 maps to CID+1, U+304B maps to CID+2, and U+304C maps to CID+3. The mappings are specified between the begincidchar and endcidchar operators, as shown in the complete CMap resource below (the non-boilerplate portions are in bold):

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (CJKTypeBlogTest-UTF32-H)
%%Title: (CJKTypeBlogTest-UTF32-H Adobe Identity 0)
%%Version: 1.000
%%EndComments
 
/CIDInit /ProcSet findresource begin
 
12 dict begin
 
begincmap
 
/CIDSystemInfo 3 dict dup begin
  /Registry (Adobe) def
  /Ordering (Identity) def
  /Supplement 0 def
end def
 
/CMapName /CJKTypeBlogTest-UTF32-H def
/CMapVersion 1.000 def
/CMapType 1 def
 
/WMode 0 def
 
1 begincodespacerange
  <00000000> <0010FFFF>
endcodespacerange
 
1 beginnotdefrange
<00000000> <0000001f> 1
endnotdefrange
 
3 begincidchar
<00000020> 1
<0000304b> 2
<0000304c> 3

endcidchar
 
endcmap
CMapName currentdict /CMap defineresource pop
end
end
 
%%EndResource
%%EOF

 
Of course, the mappings could be more efficient, by using the begincidrange and endcidrange operators for the contiguous Unicode code points whose CIDs are also contiguous:

1 begincidchar
<00000020> 1
endcidchar
 
1 begincidrange
<0000304b> <0000304c> 2
endcidrange

 
But, the makeotf tool does not require the mappings to be efficient, but it makes the mappings efficient in the resulting ‘cmap’ table. Therefore, there is no reason to go through this effort when building a UTF-32 CMap resource.

I always recommend using a UTF-32 CMap resource regardless of whether it includes mappings outside the BMP. The Adobe-Korea1-2 UTF-32 CMap resource, UniKS-UTF32-H, includes only BMP mappings, and is used as the basis for all of our OpenType Korean fonts.

As a final note, not all CIDs must be mapped from UTF-32 code points. For example, vertical variants are accessed through the use of the ‘vert‘ or ‘vrt2GSUB (Glyph SUBstitution) feature. It is possible to define these mappings in a corresponding vertical CMap resource whose final designator is “V” instead of “H,” and to specify it as the argument for the makeotf “-cv” command-line option, but is is considered better practice to define the appropriate GSUB features in the “features” file.

3 Responses to Building UTF-32 CMap Resources

  1. Simon Birtwistle says:

    Ken, you recommend vertical variants be encoded through a GSUB feature rather than explicitly stated in a CMap, but this then produces ‘implementation defined’ behaviour when such a font is used in a PDF but not embedded and subsetted (ie, without those replacements being made permanent in the repertoire and the GSUB table removed).

    The question of which GSUB features one must select when using an installed font to render a PDF is not answered in any version of the PDF spec, despite Acrobat ‘apparently’ using vert/vrt2, or some emulation of it. If one were to implicitly expect those features ‘switched on by default’ in creation software, then the necessary features would vary depending on locale, market and OS. Not to mention optional features and those that (according to Microsoft) should be switched on by default and ‘for which no UI indication should be given’!

    So, should vertical variants still be encoded in the GSUB if the font is to be used for PDFs? Should all OpenTypes with GSUBs be embedded and subsetted regardless of permissions? Should certain GSUB features be mandated in the spec when rendering PDF? Should the PDF FontDescriptor list the GSUB features used?

    Take your pick! Employing non-embedded OpenTypes in PDFs – which is often done in CJK documents because of the size of those fonts – is unreliable otherwise. This is visible now with vertical variants but the principle applies to any GSUB feature.

    • Good question, Simon.

      The behavior of AFDKO’s makeotf tool is that if a ‘vert’ GSUB feature is defined in the “features” file, the vertical CMap resource, regardless of whether it is explicitly specified on the command line as the argument of the “-cv” option, is ignored. If a ‘vert’ GSUB feature is not defined in the “features” file, the vertical CMap resource is used only for the purpose of synthesizing a ‘vert’ GSUB feature.

      Keep in mind that CMap resources perform vertical substitution at the “character code” level. In other words, a vertical variant can be specified only if its horizontal is encoded in the corresponding horizontal CMap resource. The ‘vert’ GSUB feature performs glyph substitution at the “glyph” level, which does not depend on the corresponding horizontal glyph being encoded. This is why defining a ‘vert’ GSUB feature is better.

      I believe that PDF uses the Adobe-Identity-0 CMap resources, Identity-H and Identity-V, for most of its internal rendering. I will point someone in “PDF Land” to your comment so that I can provide to you a more authoritative answer.

  2. Leonard Rosenthol says:

    Simon – the handling of ANY/ALL non-embedded fonts when rendering a PDF is mostly implementation dependent. Some parts are mandated (such as glyph selection and width computation), but layout of a text run that does not have explicit positioning is up to the viewer. Always has been.

    Since PDF is an open standard (ISO 32000), if you would like to see more of this type of thing mandated in the standard, PLEASE PARTICIPATE! There is no cost to join the ISO or you local TAG (Technical Advisory Group).

    Leonard Rosenthol
    PDF Architect
    Adobe Systems