Re: [gnome-cy] po file tools: msgconv



Hi Alan

OK, I now owe you two beers for making me read up on more stuff, but don't 
push it - there's no way you're getting more than three!  Sorry to keep on at 
this - you probably have better things to do.  Feel free to tell me to clear 
off if you like!

On Friday 04 April 2003 3:51 pm, Alan Cox wrote:
>> So presumably the revised files I sent you earlier were lossy, then?  
> Hard to tell because the originals were a little mangled

Anything specific, mayhap, perchance?

> > (2) read each file through Kartouche to give a MySQL table
> You want to turn it UTF8 first, or remember the string language in the
> table

But if I set my PC (and Kartouche - see below) to UTF-8 this should happen 
automatically, no?  That is, the encoding will always be in UTF-8, so there 
will be no danger of putative infoloss during the process?  

Setting SuSE8.1 to UTF-8:
- install package glibc-i18ndata (not installed by default) to get locales and 
charmaps;
- (as root) localedef -i cy.GB -f UTF-8 cy_GB.utf8 (stored in /usr/lib/locale)
- (as root) pico /etc/sysconfig/language; amend RC_LANG="cy_GB.utf8"; run 
SuSEconfig; reboot for good measure
- locale charmap gives UTF-8

BUT - major woe!  Under 8859-1, I could use Sht+R.Ctrl to compose �om 
sequential o and ^.  Under UTF-8 I can't.  Is there something I'm missing 
that needs to be done to re-enable this?  /usr/X11/lib/X11/locale/en_US.UTF-8 
refers to a deadkey.

> > (3) upload that to the Web and present it in a browser interface
> And generate the right Character set header (right now you dont)

OK - now I get PHP to send a UTF-8 header.  Ctrl+I in Moz shows the encoding 
as UTF-8.  I've also redone the suggestion page preface and the Remember! 
page to display the characters properly in this encoding.

Bizarrely, PHP sending the header works fine for the Kartouche dir, but not 
for the Kyfieithu dir above it, even though they both have an identical 
header file structure.  To get UTF-8 to appear in Kyfieithu, I also had to 
use the less appealing HTTP-EQUIV metatag - I spent a couple of hours 
investigating various possibilities, but I have as yet no explanation for 
this weirdness.  

Various strings in the files now display oddly, so I'll change those as I meet 
them, unless I come up with a cunning plan.  

> > (4) user inputs suggestions to the table via a browser
> Fine. Make the form UTF8

>From here (AJ Flavell's very interesting site):
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
and here:
http://www.unicode.org/faq/unicode_web.html
I read that "the browser will return the data in the encoding of the original 
form".  So if I have sent the UTF-8 header, there should be nothing more to 
do - right?  This certainly seems to be the case - entering data into the 
UTF-8 page via an 8859-1 Linux and then Windows PC kept the circumflexes OK, 
and didn't turn the character into a ? or worse.  But is this just an 
illusion of success?

I have also looked at wiring ACCEPT-CHARSET into the form:
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
but that may not be necessary if the above is OK.

> > (6) on completion read each table through Kartouche to give a po file
> > (7) run msgconv on the file to convert it to UTF-8
> You need to keep it in UTF-8

Above refers - if PC is set to UTF-8 the issue should not arise, no?

> > Also at (2), the table currently stores msgids and msgstrs in a text
> > field, but this can be changed to a BLOB format easily, which is the only
> > way MySQL can currently store UTF-8.  This would then ensure no loss in
> > the db store.
> UTF-8 is just a byte stream, there are no embedded \0 strings so mysql
> may not get the comparisons right always but its ok storing it. See the
> mysql manuals btw

So you're saying that a BLOB field is *not* necessary?  That would certainly 
have other benefits.  Experiments on Win and Lin (with 8859-1 encoding) 
putting circumflexed vowels into the db do seem to show no difference 
whatever the field type.  But again, is this just an illusion of success?

> Output Character Set of UTF-8 and the wbe browser will interpret it
> right
> Yes. IE knowns about UTF8 even if the underlying OS doesn't. As does
> netscape etc. You might end up seeing Ty not T^y that is all

OK - above refers.

> > Presumably (7) can then still be used to convert any lingering 8859
> > encodings in the file (eg input from a browser on a PC using the 8859
> > encoding) into the proper UTF-8 ones.
> Yes

OK, though presumably if everything is in UTF-8, msgconv is redundant (except 
for those strings which were previously entered in 8859-1 encoding, and which 
need to be converted to UTF-8?).

> Basically the rule is
> UTF8 -> anything 8bit is lossy
> anything 8bit to UTF8 is not lossy

OK - it does make sense to keep the workflow in one format all the way 
through.

With your much greater knowledge of this area, do you think the revised system 
should cover all bases?  (In theory, I mean - obviously from a practical 
viewpoint the state of the actual output files may determine the need for 
further work.)

> There is a whole seperate story about upper/lower case converting that
> may bite you with other languages (notably Turkish) but are safe on
> Welsh/English

Welsh and English are enough for the moment, thanks :-)  I will, however, wish 
to come back to a detailed discussion of Turkish some time in 2006 ....

Best wishes

Kevin



_______________________________________________
gnome-cy mailing list
gnome-cy pengwyn linux org uk
http://pengwyn.linux.org.uk/mailman/listinfo/gnome-cy



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]