This is a short tutorial about how to contribute to the fonts.
0. Get ready
- I really recommend to do the development on a native Linux system, because all necessary tools come already with most distributions. It doesn't really matter which distribution you use (I personally use Debian Etch and Ubuntu 8.04 Hardy Heron), as long as it is fairly recent.
- The fonts are produced using Fontforge a superb open source font editor. Get a recent copy of it (from 2008-03-30 or later) and install it on your development machine. As the fonts we work on are fairly large in size and therefor use a lot of memory when opened in Fontforge, I recommend to have at least 1GB of RAM.
The following screenshots show my recommended Fontforge preferences: Screenshot 01 Screenshot 02 Screenshot 03 Screenshot 04 Screenshot 05 Screenshot 06 - Get the development tarball of the fonts (they are already in fontforge's native sfd format): http://people.ubuntu.com/~arne/cjk-unifonts/uming_ukai_sfd.tar.bz2 (55MB)
- Untar them into your working directory: tar xvfj umingukai_sfd.tar.bz2_
- Never modify the original sfd files in place! Use the templates instead: http://people.ubuntu.com/~arne/cjk-unifonts/uming/umingTPL.sfd and http://people.ubuntu.com/~arne/cjk-unifonts/ukai/ukaiTPL.sfd
- As the UMing and Ukai fonts are based on the original free Arphic fonts, it's a good idea to have them on the system, too. (On Debian / Ubuntu use apt-get install ttf-arphic-bsmi00lp ttf-arphic-gbsn00lp (for the Ming/Song style fonts) and apt-get install ttf-arphic-bkai00mp ttf-arphic-gkai00mp (for the Kai style fonts). You will find the ttf files in /usr/share/fonts/truetype/arphic/. With other distributions the font packages might have different names.)
- Get agrep. It allows you to grep a text file with an AND search. (On Debian / Ubuntu: apt-get install agrep)
- One very handy text file, which lists all unique Han characters in Unicode 5.0, their radicals, radical index and remaining strokes and components they are built of in IDS (Ideographic Description Sequence) format: ids_rs.tar.gz (tarball) or ids_rs.zip (zipfile) Caveat: This ids_rs.txt file is not final, nor is it official or error free. It is not intended to be redistributed. If you want to have better data, use the CHISE project and the Unihan.txt file provided by Unicode.
- Untar the text file in your working directory: tar xvfz idsrs.tar.gz_
- Han characters can have different glyph shapes, depending on the region they are used in. Glyphs usually look different in China, Hong Kong, Taiwan, Japan, Korea and Vietnam, even if they share the same codepoint. PDF files, which show these different shapes for each codepoint in the Unicode CJK Unified Ideographs and CJK Unified Ideographs Extension A blocks are available here: https://specifications.iso.org/ittf/PubliclyAvailableStandards/c039921_ISO_IEC_10646_2003(E).zip (Please note that the glyph shapes have been submitted to Unicode some time ago. Japan has revised some glyph shapes in its latest JIS X0213-2004 standard.)
In this PDF, the glyphs for Hong Kong are missing. Therefor I produced a similar PDF for Hong Kong glyphs by myself, using the Ming font which is provided by the HKSAR government: http://people.ubuntu.com/~arne/cjk-unifonts/CJK_Glyphs_HK.tar.bz2 (Untar them with: tar xvfj CJKGlyphs_HK.tar.bz2_)
1. Basics
Now that you've downloaded all this stuff, we are ready to begin.
1.1. ids_rs.txt
If you have downloaded the ids_rs.txt file, now it's the time to take a look at it. You can open it with any text editor. However, as this text file contains all Han characters from Unicode 5.0, including those of Extension B, you would need to have a font installed to actually display the characters. I found the Hannom fonts (they are free to download and use, but not free to modify or redistribute) to be perfect for this task. Install them on your system, on Linux, just copy them to your ~/.fonts/ directory and call fc-cache ~/.fonts/ to update the cache.
The text file has multiple columns:
- Codepoint
- Character
- Radical
- Radical Index / Additional Strokes
- IDS Sequences (can be more than one column)
- Remarks (optional)
1.2. IDS (Ideographic Description Sequence)
The IDS describes how a Han character looks like. That is, if a specific Han character is not encoded in Unicode yet, you could use IDS to describe how the character looks like. To do this, IDS consists of description characters and Han characters. The description characters are:
- U+2FF0 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF0.png] IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO RIGHT
- U+2FF1 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png] IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW
- U+2FF2 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF2.png] IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT
- U+2FF3 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF3.png] IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW
- U+2FF4 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF4.png] IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND
- U+2FF5 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF5.png] IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE
- U+2FF6 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF6.png] IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW
- U+2FF7 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF7.png] IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT
- U+2FF8 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF8.png] IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT
- U+2FF9 [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF9.png] IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT
- U+2FFA [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FFA.png] IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT
- U+2FFB [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FFB.png] IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID The Han characters are the components the desired character is made of. The syntax is: First the IDC followed by the components (from outside to inside, from left to right, from top to bottom).
Examples:
- U+4E69 乩 -> [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF0.png]占乚
- U+4E6A 乪 -> [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FFA.png]乙田
- U+4E6B 乫 -> [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png]加乙
- U+55C0 嗀 -> [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF0.png][[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png][[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF3.png]士冖一口殳 The description for U+55C0 嗀 reads like: ([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF0.png]([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png]([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF3.png]士冖一)口)殳 ).
1.3. How to use this?
Most of the Han characters in Unicode can be composed out of other Han characters. By far the most cases use a LEFT TO RIGHT composition ([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF0.png]), the second most common is ABOVE TO BOTTOM ([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png]). Almost all of these use a radical in either the LEFT, RIGHT, TOP or BOTTOM position (e.g. LEFT: 亻 冫 口 土 女 山 彳 忄 扌 日 月 氵 ..., RIGHT: 刂 支 攵 阝 ..., TOP: 入 八 冖 宀 艹 ..., BOTTOM: 乙 灬 ...).
- Example: We want to add a new glyph to the font. Let's take U+8281 芁.
This character is composed of 几 and the 艹 radical component. The 艹 radical component sits on TOP of 几, so the IDS for this character would be http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png艹几. The 艹 component in this case does not take much space in the glyph, that means the BOTTOM part 几 has to use more space in the resulting glyph than the radical component. Our goal is now to find an existing glyph in the font with a similar arrangement (radical on top, 几 on bottom, the radical does not use much space). Therefor, we can use agrep to filter our ids_rs.txt file: agrep '⿱;几' idsrs.txt | less_ . This means we search all lines which have [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png] AND 几 and display them with the pager less.
The result in this case is quite long, so we can filter some more... as we are looking for a 几 with a radical component on TOP, we know that the additional strokes (means in addition to the radical component) is 2. Let's put this into our agrep search string: agrep '.2;⿱;几' idsrs.txt_ . Et voilà: the list is much shorter now. (Screenshot)
From this search result the character U+5197 冗 冖 014.2 ⿱冖几 jumps right into sight and seems to be a perfect candidate. Now, assume we have already loaded the font (e.g. UKai) and the matching template (e.g. ukaiTPL) in Fontforge, we can take a look if U+5197 already exists in our font. View -> Goto -> U+5197 reveals that we are lucky (Screenshot). Now we open the glyph with a double click and select the 几 part by carefully double clicking on the spline (Screenshot). Then we copy the selection into the clipboard by pressing CRTL+C. In the template file window we go to U+8281, our missing character: View -> Goto -> U+8281 (Screenshot).
Double click on the empty character and paste the 几 component into it (CRTL+V) (Screenshot). You can see that we have 3 layers available in the editing mode: Front, Back and Guide. Now we have to find a suitable radical component, which fits in size and slant to the 几 we already pasted. For this, we can now switch back to our main font window (e.g. UKai) and go to the same codepoint like in the template: View -> Goto -> U+8281 (which is empty of course) (Screenshot). Now we look around this position if any of the surrounding glyphs has a promising radical components, which we could borrow. In this case U+8293 芓 looks like a good candidate (Screenshot).
Double click on that character, select the radical component by double clicking on it and copy it to the clipboard (CRTL+C) (Screenshot). Now switch back to the template file, double click on the character we want to edit (if you have closed that window before), select the Back layer and paste the radical component into it (CRTL+V) (Screenshot). We can now see if the radical component matches our BOTTOM component. As we used two layers, they won't conflict with each other. Means moving one of the parts around won't disturb the other. In this case it's a perfect match and we don't need to do any further modification. Let's move the radical component onto the Front layer into it's final position: select the radical component, CRTL+X, switch to the Front layer and paste with CRTL+V (Screenshot). If the result looks good, we can take care of the next glyph (Screenshot).
Now, what if we would need to move a component around to make it fit? For this, I found it to be a good idea not to use the mouse to drag the component around, but the arrow keys on the keyboard. Pressing the arrow keys will move the selection one decimal point, holding down the ALT key while pressing the arrow keys will move the selection 10 decimal points.
2. More advanced stuff
For an exhaustive tutorial about how to use Fontforge, please see the Fontforge website.
2.1. Resizing
In this example, we create the character U+2A6A5 𪚥, which simply consists of 4 "dragons" (龍) using the UMing font.?BR This glyph could be divided into two horizontal parts ([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF0.png]龍龍 stacked on top) or two vertical parts ([[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/2FF1.png]龍龍 side by side). If we scan through our ids_rs.txt file (grep 龍 idsrs.txt_), we will find the perfect candidate: U+9F98 龘, which consists of 3 "dragons", one on top and two side by side below. This constellation ensures, that the height of the upper and lower parts are about the same, the ideal situation for our target glyph.
- We open our source font (UMing in this case) and find our candidate glyph: View -> Goto -> U+9F98
- We open the editing window by double clicking on that character (Screenshot 1)
- As we only want to copy the lower half of the glyph, we select the parts we want to copy. The bottom right "dragon" is conjoined with the "dragon" on top: select a rectangle around the area we want to copy. Then press CRTL+C. (Screenshot 2)
- In our template file, we find our target character: View -> Goto -> U+2A6A5
- We open the editing window of that character by double clicking on it.
- Now, we can paste our selection into it. CTRL+V (Screenshot 3)
- With the pasted stuff still selected, we zoom in a bit to get a better look on the splines and points involved. Press CRTL+ALT+SHIFT+= .
- Now we can remove those points and splines we don't need. Select them and press Delete. (Screenshot 4)
- If you compare the right "dragon" with the left one, you will notice that at those points, where the bottom right "dragon" was conjoined with the top "dragon", the splines are not closed now (Screenshot 5). We need to fix this:
- Copy the missing parts from the left "dragon" one by one: CRTL+C. (Screenshot 6)
- Paste the selection at the same position: CRTL+V
- Move the pasted part across to the right dragon by using the arrow keys. The arrow keys on your keyboard move the selection exactly one point at the time into the desired direction. Using the mouse to drag would probably distort the glyph when you connect the parts. So, be careful! Holding the ALT key pressed while using the arrow keys, will move the selection 10 points at a time into the desired direction. (Screenshot 7)
- Connect the pasted selection with the right "dragon" by using the arrow keys on your keyboard. (Screenshot 8)
- Do the same for the other missing part. (Screenshot 9) (Screenshot 10) (Screenshot 11)
- Now, the bottom part of our target glyph is complete. We can try to duplicate it and stack it on top: select the whole thing by drawing a rectangle around the glyph with your mouse, press CRTL+C and CRTL+V to paste it at the same position (Screenshot 12). Use ALT+ Arrow Up to move the whole thing on top (Screenshot 13). Zoom out to get a better overview how the result looks like. (View -> Fit).
- You will notice, that our result is now too high and sticks out of our bounding box (Screenshot 14). We need to fix this. The best way to do this is to "resize" the bottom part and then duplicate it again.
- remove the upper half of the glyph by selecting the upper part and press the Delete key. If that part is still selected from our moving step, just press the Delete key.
- Zoom in to get a better view: CRTL+ALT+SHIFT+=
- Now we need to skew the two "dragons" vertically. Don't use any of fontforge's builtin Transformation functions, we want to keep the stem widths intact! Instead find a good position to do the resizing manually. I suggest to make the space between the horizontal strokes of the "meat" part ⺼ and the corresponding horizontal strokes of the right part of the "dragon" smaller.
- Select the upper part of the two "dragons" carefully, then press the Arrow Down key 5 times (Screenshot 15).
- You'll see that the "meat" part has two points on the spline we want to resize. As they have no function here, we can delete them. Select the two points in each "meat" part and press Delete (Screenshot 16). Now the spline is open (Screenshot 17). Select the lower point of the spline, select the "Add corner point" tool [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/ff_corner_point.png] from the toolbar and click on the upper point to close the spline (Screenshot 18). Select the "Pointer" tool [[!img http://people.ubuntu.com/~arne/cjk-unifonts/png/ff_pointer.png] to finish the task. Repeat for the other "meat" part on the right side.
- Now include one more horizontal stroke in each part and repeat. Also 5 points down (Screenshot 19).
- In total, we skewed our bottom half of the character by 10 points now. Let's see if it fits now.
- Select the whole thing, press CRTL+C followed by CRTL+V, move the pasted selection up by using ALT+ Arrow Up a few times. Et Voilà, it fits (Screenshot 20). Now, why not just use the single "dragon" U+9F8D 龍, scale it by 50% and duplicate it three times and move the parts into their correct position? Would be a lot easier, wouldn't it?
Well, let's compare the result: (Screenshot 21)
On the left hand side the proper glyph, on the right hand side the "scaled" one: as we can easily see, the "scaled" glyph is much thinner than the rest of the glyphs in the font. When reading a text containing this character, this glyph will stick out as being too thin compared to the other ones and will probably look blurry because of that. In other words: it looks ugly!
2.2. Editing splines
2.3. Glyph variants
CJK characters are used in different regions and can look different in each region, although the share the same codepoint. If you have downloaded the PDFs I mentioned in Section 0, item 10, you can see which glyph has been submitted to Unicode by each national body for each codepoint used in that region.
The regions are:
Region | Internal abbrevation | UMing font flavor | UKai font flavor |
China | C | AR PL UMing CN | AR PL UKai CN |
Taiwan | T | AR PL UMing TW | AR PL UKai TW |
Japan | J | AR PL UMing JP | AR PL UKai JP |
Korea | K | AR PL UMing KR | AR PL UKai KR |
Vietnam | V | AR PL UMing VN | AR PL UKai VN |
Hong Kong | H | AR PL UMing HK | AR PL UKai HK |
To produce these font flavors, we collect all regional flavored glyphs in the same font file and assign tags to them, depending on for which region they should be used. For example: U+4EE4 令 has 3 different glyph shapes: C, TVH and JK. Therefor, we have three glyphs: uni4EE4.C, uni4EE4.TVH and uni4EE4.JK (Screenshot).
Short Excursion: Codepoints, Encodings, Slots, Glyphs and more
Codepoints and Encodings: In an application (e.g. a text editor) you type some bytes. The encoding maps these bytes to internal index numbers, the codepoints. In recent years ISO 10646-1, commonly known as "Unicode", has become the de-facto standard to represent texts. For example the codepoint 4EE4 in ISO 10646-1 has been assigned the character 令. To use this character on the computer, we need to use a byte sequence which maps to this ISO 10646-1 codepoint. The most common encoding for this on to-date Linux systems is "UTF-8". In UTF-8 the byte sequence, which represents the ISO 10646-1 codepoint 4EE4 is: 0xE4 0xBB 0xA4.
- Slots and Glyphs: To be able to read an "encoded" text (e.g. with the UTF-8 encoding), we need some graphical representation to map the byte sequence to something human readable. (You can open a non-Latin UTF-8 encoded text file in a hex editor and try to read the text. You will see what I mean. ;-) ) To do this mapping, we use fonts. A font is a collection of drawings (glyphs), which can map to one or more ISO 10646-1 codepoints. These glyphs are organized into slots. A TrueType font can have a maximum of 65536 slots and therefor 65536 glyphs. The glyphs in a font can have a mapping to one or more codepoints, but don't need to have one. Glyphs, which don't have a mapping to a codepoint are usually invisible to the user. Now why would we want to have invisible (unassigned) glyphs?
Well, in most cases the reason is, that we don't want them to be visible for the general public all the time, but only in some specific situations. In TrueType fonts this is done by using "features". The probably most common case for this is "contextual shaping" used in many writing systems, or "ligatures". Here, having a specific codepoint combination in a text would result in a different (unencoded) glyph than they would normally map to individually. - Variation Selectors: Unicode has assigned a codeblock to contain "Variation Selectors". That is: if a Han character gets followed by a Variation Selector, a different glyph (the variant shape) is selected from the font instead of the usual one. The details are described here: http://www.unicode.org/ivd/. There is currently one set of Variants registered. This comes from Adobe and is for Japanese Kanji. Certain Kanji can have a different shape, depending on the context as used in the text. Nevertheless, it's still the same codepoint. To display the desired shapes, one can use the registered Variation Selector sequences with a font capable of displaying the differences.
It is my aim to include as many of these variants as possible. - Stylistic TrueType Features: TrueType knows several "features" which can be used to select a different glyph from a font than the one assigned with that codepoint. These can be combined with a language (zh, ja, ko, ...) and scripting (Hans, Hant, Hani, ...) tag to make a specific variant be selected automatically in a given language environment. However, since this relies heavily on the applications the user uses and the vast majority of available applications don't support this, it is usually not used in fonts either.
- TrueType Collections: TrueType Collections are basically multiple TrueType fonts bundled in a single .ttc file. If we have multiple single .ttf fonts, which share a large amount of glyphs and they are assigned to the same codepoints (like in this project), we can combine them into a single .ttc file. The advantage is that the glyphs which are shared between these fonts are only stored once. This saves a lot of space.
Technically, each single unique glyph is stored only once in the .ttc file. It then has multiple font headers and cmap tables. Each cmap table contains the mapping between glyph and codepoint. This way the font AR PL UKai CN can have a mapping to glyph uni4EE4.C for the codepoint U+4EE4, while the fonts AR PL UKai HK, AR PL UKai TW and AR PL UKai VN have a mapping to glyph uni4EE4.TVH instead.
This technology requires:- a specific naming scheme throughout all variant glyph,
- all glyphs (including all variants) to be stored in the same font
- only one of the variants can be assigned to a codepoint at any time. The others are "unassigned".
- some magic behind the scenes when building the fonts, so that the appropriate glyphs get assigned the codepoint they should be used with, while all other variants for the same codepoint must be "unassigned".
- Example:
We want to add a new glyph, which is a variant of U+8281 芁 we have created earlier in this tutorial. In this case the radical component 艹 has different shapes in China and Japan on the one hand and in Taiwan on the other hand. When we created the first glyph of this character, I chose to pick the Traditional Chinese form of the 艹 radical. Now, we are going to add the variant with the Simplified Chinese version of the 艹 radical. - If you haven't done so already at the end of the first example, save the file to a new filename.
- Change the font encoding to be "Glyph Order" (Encoding -> Reencode -> Glyph Order). Now it only displays one glyph in Slot 1.
- Open the glyph (double click), select the 几 part and copy it into the clipboard (CRTL+C).
- Close the glyph editing window.
- Add another encoding slot: Encoding -> Add Encoding Slots -> 1 (Screenshot)
- Open the new empty slot by double clicking on it.
- Paste the 几 part into the editing window (Screenshot).
- Now we need the simplified radical version. If you have the original Arphic fonts lying around, it's a good idea to open the gkai00mp.ttf font file and look for a matching radical component there.
- I found the character U+82AC 芬 to have a suitable radical component (Screenshot). Copy that one and paste it into the glyph editing window in the Back layer (Screenshot). If it fits, move it over to the Front layer.
- Close the editing window (Screenshot).
- Now we need to rename the traditional variant. Right mouse click on the glyph and choose "Glyph Info" (Screenshot).
- In the "Unicode Name" field, we append .T, since this form is used only in Taiwan (Screenshot). Then click OK.
- Now the same game for the simplified variant. Right mouse click on the glyph and choose "Glyph Info".
- In the "Unicode Name" field type uni8281.CJ, since this variant is used both in China and Japan (Screenshot).
- Make sure that the "Unicode Value" field is -1, this glyph is unassigned, because the traditional form has already the codepoint assigned and only one of them can have the codepoint at any time.
- Now we are done with the font part, we should have two glyphs now, uni8281.T and uni8281.CJ (Screenshot). Save your work! :)
- In order for the build script to do the right thing, we need to create a small text file for each font flavor. Just name them "vars_cn.txt", "vars_jp.txt" and "vars_tw.txt" in this case. In each of these text files type the Unicode codepoint, followed by a tab space and the glyph name which should be used in each font flavor. In this case, "vars_cn.txt" and "vars_jp.txt" would each get the line 8281
uni8281.CJ and "vars_tw.txt" would get a line 8281uni8281.T . When you add more glyphs, just append the entries to these files (Screenshot). - Later, when you are done, zip or tar the .sfd and .txt files up and send them to me for inclusion into the main fonts. Currently it's best to send me this stuff via e-mail. I'm working on setting up a revision control system for this project, but because of the sheer size of the fonts, I'm currently hitting the wall. But I hope that over time this issue will get solved.