Devanagari abstract glyphs



Hi Owen and friends,

   I've been playing with Devanagari. You'll find below the results of
my transcribing a passage from Daniels & Bright into Unicode, as well
as a handy little perl script for converting hex unicodes into UTF-8.

   So I think it will be reasonably straightforward to define an
abstract glyph repertoire for Devanagari. As I see it, there is one
major design issue.

   The Unicode for a consonant includes the inherent "a" vowel. Thus,
consonant clusters are represented as C + virama + C, etc. Personally,
I think the abstract glyphs should not include the inherent "a". Thus,
the abstract glyph for "k" should correspond to U+0915 Ka U+094D
Virama U+200D Zero width joiner. The unicode U+0915 Ka should
correspond to the abstract glyphs "k" and "a".

   Since virtually all other vowels are diacritics or additional
glyphs in addition to the inherent "a" form, the "a" glyph will be
present at the end of most all consonant clusters, even if the vowel
is something else.

   Thus, for reasonable display performance, ligatures between all the
consonants and the "a" will be essential. However, if any such
ligature is missing, or ligatures between two halanta consonants,
results should still be legible.

   The virama abstract glyph is still present, but is only used when
the virama is to be explicitly displayed. In general, it does not
participate in ligation.

   For illustration, the first line of the text sample below would
correspond roughly to this sequence of abstract glyphs:

n a ai n a anusvara             (naina\.m)
i ch a n d a i n t a            (chindanti)
sh a s t r a aa i nn a          (shastr\-a\.ni)
n a ai n a anusvara             (naina\.m)
d a h a i t a                   (dahati)
p a aa v a k a visarga          (p\-avaka\.h)

   Thus, assembling the abstract glyph repertoire should be relatively
straightforward - it should be a nearly analogous to Unicode, except
with the implicit "a" forms of the consonants replaced with the
corresponding halanta consonants. I haven't gone through the code
chart carefully to find exceptions (or considered languages other than
Sanskrit at this point), but I think it's a reasonable and good
starting point.

   Your thoughts are welcome.

Raph


Sample of Sanskrit, from Daniels & Bright, section 31.

U+0928 48 28 02 0020
091B 3F 28 4D 26 28 4D 24 3F 0020
0936 38 4D 24 4D 30 3E 28 3F 0020
0928 48 28 02 0020
0926 39 24 3F 0020
092A 3E 35 15 03 64 0020
0928 0020
091A 48 28 02 0020
0915 4D 32 47 26 2F 28 4D 24 4D 2F 3E 2A 4B 0020
0928 0020
0936 4B 37 2F 24 3F 0020
092E 3E 30 41 24 03 65 0020
0905 1A 4D 1B 47 26 4D 2F 4B 0020
093D 2F 2E 26 3E 39 4D 2F 2B
093D 2F 2E 15 4D 32 47 26 4D 2F 4B 0020
093D 36 4B 37 4D 2F 0020
090F 35 0020
091A 64 0020
0928 3F 2F 4D 2F 03 0020
0938 30 4D 35 17 24 03 0020
0938 4D 25 3E 23 41 30 1A 32 4B 0020
093D 2F 02 0020
0938 28 3E 24 28 03 65

`Weapons do not cut it [the soul], fire does not burn it.
Waters do not wet it, wind does not dry it.
It cannot be cut, or burned, or wetted, and cannot be dried.
It is eternal, all-pervading, fixed, immovable, primeval.'
 -- Bhagavadg\-it\-a 2:23.

#!/bin/perl
$go = 0;
while (<>) {
    chomp;
    if (/U\+(.*)/) {
	$go = 1;
	$line = $1;
    } elsif ($go && /[0-9a-fA-F]/) {
	$line = $_;
    } else {
	$go = 0;
    }
    if ($go) {
	@codes = split (/\s+/, $line);
#	print (join (':', @codes)."\n");
	foreach $code (@codes) {
	    $fullcode = substr ($lastcode, 0, (4 - length ($code))).$code;
	    $unicode = hex $fullcode;
#	    printf " %x", $unicode;
	    if ($unicode < 0x80) {
		print pack 'C', $unicode;
	    } elsif ($unicode < 0x800) {
		print pack 'CC', 0xC0 | $unicode >> 6, 0x80 | ($unicode & 0x3f);
	    } elsif ($unicode < 0x10000) {
		print pack 'CCC',
		0xE0 | $unicode >> 12,
		0x80 | (($unicode >> 6) & 0x3f),
		0x80 | ($unicode & 0x3f);
	    }
	    $lastcode = $fullcode;
	}
#	print "\n";
    }
}



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]