[Gnome-print] Unicode, Japanese and other issues

From: Lauris Kaplinski <lauris helixcode com>
To: gnome-print helixcode com
Subject: [Gnome-print] Unicode, Japanese and other issues
Date: 25 Nov 2000 01:17:39 -0200
Hello!

I think, that there is time to start constructive work now, and find together
best way to solve the problems. I think nobody is lacking good will, just
> people are lacking knowledge of other locales, fonts etc.

> OK, I see your (and gnome-print's?) plan for the legacy but, for another
> example, does Evolution development team in HelixCode also think so?
> Evolution seems to send/receive only UTF-8 mails, but almost of all mailer
> in Japan only handles JIS code so at least it needs optional feature to
> read/write a JIS code mail. UTF-8 mails are still unacceptable in Japan.
> Our staff Tagoh have sent some patches already, but it seems to be difficult
> for Evolution to accept them until its release.

Hmmm...
1. Receiving mail. Evolution has to be receive mail in whatever encoding, as
long as if conforms to RFC-xyz, i.e. specifies charset correctly.
2. Sending mail. Sending UTF-8 is RFC-cally correct, if properly marked. But if
dominant Japanese mailreaders cannot understand it, and there is (hopefully)
potential market for Evolution in Japan, I am quite sure, we can figure
something out.
I think, that maybe simple selection for preferred incoming/outgoing
charset would be OK.
Incoming will be used, if message violates RFC - i.e. it is 8-bit charset,
without proper charset specification. If we accept RFC violation, what we do,
we can as well expect it to be JIS instead of iso-8859-1.
Outgoing will be used, if text can be encoded to it without loss (i.e.
we can encode text in JIS, if contains basic latin + japanese, but if
there are also cyrillic characters, utf-8 is the only way).

> Evolution will be introduced with GNOME 1.4 as one of the important GNOME
> features, but our people will never find it useful without JIS code and
> XIM support. I'm very afraid of that, and of such cases. (Japanese don't
> use pan, balsa, xchat, gnome-pilot, gtkhtml for same reason.)

I still think that Evolution can be much more easily adapted to Japanese,
because almost all its input/output in collected into few places,
namely:
gal (e-font, e-unicode): ETable, EText font handling. If we have to make
modifications to support other input methods, it would be reasonable to
write a wrapper also there (currently we have trivial wrapper to support
simple 8-bit xkb keyboards).
gtkhtml input/output handling happens also through gal.

The current logic is following:

1. Input:
GdkEventKey -> translated by gal to utf-8 string -> inserted into utf-8 text

2.1. Output to EFont based text (ETable, EText, GtkHTML):
utf-8 text -> efont otuput wrapper
2.2. Output to Gtk Widgets:
utf-8 text -> e-unicode wrapper -> Gtk+ native encoding

EFont is wrapper, that contains pointers to 2 GdkFonts (plain and bold),
and transcoders (unicode_iconv) for converting utf-8 text into native
encoding of those fonts.

E-unicode text wrapping does basically the same. It looks font
encoding, and tries to get utf8->native & native->utf8 transcoders. Either
one is then used, depending whether user is writing
(e_utf8_gtk_entry_set_text) or reading (e_utf8_gtk_entry_get_text) text.

Now, there are several problems:
1. I do not want to make it overcomplex. It is meant as temporary wrapper
around fonts that will be replaced by native utf8 in gtk2.0. One of main
reasons I wanted to use those, was the availability iso-10646 fonts (which are
2-byte fonts and cannot be used natively).
2. While writing EFont/EUnicode, I had only vague idea, what gdk_fontmaps
exactly do. So the code is designed with only single font in mind. But that
is fixable.

How to adapt it to Japanese?
The most important point is probably, that we have to accept fontsets
transparently for EFont (currently only 1st font of fontset is analyzed).
It should simply mean duplicating all font analyzing data.
Issues, I do not know:
Are there reasonable fontsets with high number of fonts (> 2)?
How is font selection (from fontset) done, while printing JIS? I suppose
that happens transparently in xlib? So if we know, that EFont is built
from iso-8859-1 + JIS font, can we simply transcode utf8 to JIS, and
use gdk wchar text drawing?

Another issue is selecting proper font to be used. But that is UI issue,
and hopefully can be adressed with little good will.

>> PS. Gnome-print is usable for CJK languages with very minimal changes. The
>> only thing you have to guarantee is correct translating from UTF-8 to
>> font native glyph mapping. Please note, that it is NOT 1:1 mapping for
>> europaean languages at moment, but instead it is font-specific one, and
>> constructed during font loading from PostScript glyph names. So supporting
>> CJK (in trivial case) means simply a standard way for populating CJK
>> character block in font unicode mapping, with correct glyph codes.
>
> Really? 
>
>> Btw: Are there C/J/K PostScript fonts for free download somewhere? I would
>> be very interested to test them out, but I have not found any.
>
> Sorry I can't find downloadable CJK PostScript font anywhere.
> Some printer has its own font in itself and we usutally test with our printer
> or our Japanized version of ghostscript. (We still need to make patches for every
> ghostscript. It's also a headache...)
>
> But there are many ttf font and some Adobe CID fonts.
>
> CIDFont: ftp://ftp-pac.adobe.com/pub/adobe/acrobatreader/unix/4.x/
>  jpnfont for Japanese
>  korfont for Korean

OK. At moment gnome-print can use only type1 fonts (there is experimental
TTF support in gnome-font, but I am not sure, whether it is production
stable for 1.4).

I outline the current logic, used by gnome-print:

1. If font is loaded, afm file is scanned. All glyph definitions found there,
are saved in GnomeFont structure sequentially. So for most western fonts
space is glyph 1 (glyph 0 is always empty square), '!' is glyph 2 and so
on.

2. While building glyph map, we try to find unicode value from glyph PostScript
name. If unicode value is not known, it will be assigned on in unicode
private space (U-E000 upwards). Unicode->glyph map is constructed for
every font.

3. Now, while actually printing, we either download, or reencode resident font
into 16-bit composite font, with glyph mapping identical to the GnomeFont
one (i.e. space usually 1 etc.).

4. Utf-8 characters are converted to font glyph values, using given font
unicode->glyph mapping.

5. Those glyph values are then used for printing (as font is encoded into
16-bit composite, we use 2-butes per glyph). So our PostScript output does
not contain any readable text.

Now, as you can see, 16-bit reencoded fonts are used even for printing
basic latin languages. AFAIK, it should be quite easy, to add support
for Japanese/Chinese too (but as I said, current gnome-print only uses
Type1 fonts).

I could do that myself, but I have no idea, how to handle Japanese
Type1 fonts (and I do not have any here too). These have to be already
composite fonts, because base Type 1 font can contain only 256 glyphs.

There are basically 2 ways:
1. Reencode those fonts, build unicode->glyph map, translate utf8 to
glyph values, print those in 16bit. (exactly as is done with western
fonts).
2. Add a tag in font, indicating that it uses JIS or BIG5 or whatever
encoding. While printing translate utf to that encoding, and print,
using whatever is the standard subfont selection for those encodings.

The whole reencoding suff is added, because many latin fonts contain
more glyphs, than are present in standard (iso-8859-1) encoding. So
instead of limiting possible glyph space to whatever one is default
by system locale, we can now always use ALL glyphs in font.
Btw. that allows us to use Symbol font for printing greek text (without
accents, of course) ;)

On the other hand - there probably is no such problem with eastern
encodings - they have to contain huge amount of glyphs anyways, and
there is not a set of slightly different encodings (like iso-8859-x
are for europaean languages). but I do not know much about that stuff.
So if reencoding eastern fonts is too complex or too memory hungry,
we can always preserve their encoding (i.e. build GnomeFont using
map identical to taht encoding), and add little modifications to
output routines.
Or we need to know, how to build unicode->glyph mapping, reencode font
to 16-bit font-specific composite, and output using present methods.

I cannot estimate, which way is simpler. But both should be doable.

Regards,
Lauris
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]