Re: [gnome-cy] po file tools: msgconv
- From: Kevin Donnelly <kevin dotmon com>
- To: Alan Cox <alan lxorguk ukuu org uk>
- Cc: Gnome Welsh List <gnome-cy www linux org uk>
- Subject: Re: [gnome-cy] po file tools: msgconv
- Date: Mon, 7 Apr 2003 12:19:30 +0000
Hi Alan
OK, I now owe you two beers for making me read up on more stuff, but don't
push it - there's no way you're getting more than three! Sorry to keep on at
this - you probably have better things to do. Feel free to tell me to clear
off if you like!
On Friday 04 April 2003 3:51 pm, Alan Cox wrote:
>> So presumably the revised files I sent you earlier were lossy, then?
> Hard to tell because the originals were a little mangled
Anything specific, mayhap, perchance?
> > (2) read each file through Kartouche to give a MySQL table
> You want to turn it UTF8 first, or remember the string language in the
> table
But if I set my PC (and Kartouche - see below) to UTF-8 this should happen
automatically, no? That is, the encoding will always be in UTF-8, so there
will be no danger of putative infoloss during the process?
Setting SuSE8.1 to UTF-8:
- install package glibc-i18ndata (not installed by default) to get locales and
charmaps;
- (as root) localedef -i cy.GB -f UTF-8 cy_GB.utf8 (stored in /usr/lib/locale)
- (as root) pico /etc/sysconfig/language; amend RC_LANG="cy_GB.utf8"; run
SuSEconfig; reboot for good measure
- locale charmap gives UTF-8
BUT - major woe! Under 8859-1, I could use Sht+R.Ctrl to compose �om
sequential o and ^. Under UTF-8 I can't. Is there something I'm missing
that needs to be done to re-enable this? /usr/X11/lib/X11/locale/en_US.UTF-8
refers to a deadkey.
> > (3) upload that to the Web and present it in a browser interface
> And generate the right Character set header (right now you dont)
OK - now I get PHP to send a UTF-8 header. Ctrl+I in Moz shows the encoding
as UTF-8. I've also redone the suggestion page preface and the Remember!
page to display the characters properly in this encoding.
Bizarrely, PHP sending the header works fine for the Kartouche dir, but not
for the Kyfieithu dir above it, even though they both have an identical
header file structure. To get UTF-8 to appear in Kyfieithu, I also had to
use the less appealing HTTP-EQUIV metatag - I spent a couple of hours
investigating various possibilities, but I have as yet no explanation for
this weirdness.
Various strings in the files now display oddly, so I'll change those as I meet
them, unless I come up with a cunning plan.
> > (4) user inputs suggestions to the table via a browser
> Fine. Make the form UTF8
>From here (AJ Flavell's very interesting site):
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
and here:
http://www.unicode.org/faq/unicode_web.html
I read that "the browser will return the data in the encoding of the original
form". So if I have sent the UTF-8 header, there should be nothing more to
do - right? This certainly seems to be the case - entering data into the
UTF-8 page via an 8859-1 Linux and then Windows PC kept the circumflexes OK,
and didn't turn the character into a ? or worse. But is this just an
illusion of success?
I have also looked at wiring ACCEPT-CHARSET into the form:
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
but that may not be necessary if the above is OK.
> > (6) on completion read each table through Kartouche to give a po file
> > (7) run msgconv on the file to convert it to UTF-8
> You need to keep it in UTF-8
Above refers - if PC is set to UTF-8 the issue should not arise, no?
> > Also at (2), the table currently stores msgids and msgstrs in a text
> > field, but this can be changed to a BLOB format easily, which is the only
> > way MySQL can currently store UTF-8. This would then ensure no loss in
> > the db store.
> UTF-8 is just a byte stream, there are no embedded \0 strings so mysql
> may not get the comparisons right always but its ok storing it. See the
> mysql manuals btw
So you're saying that a BLOB field is *not* necessary? That would certainly
have other benefits. Experiments on Win and Lin (with 8859-1 encoding)
putting circumflexed vowels into the db do seem to show no difference
whatever the field type. But again, is this just an illusion of success?
> Output Character Set of UTF-8 and the wbe browser will interpret it
> right
> Yes. IE knowns about UTF8 even if the underlying OS doesn't. As does
> netscape etc. You might end up seeing Ty not T^y that is all
OK - above refers.
> > Presumably (7) can then still be used to convert any lingering 8859
> > encodings in the file (eg input from a browser on a PC using the 8859
> > encoding) into the proper UTF-8 ones.
> Yes
OK, though presumably if everything is in UTF-8, msgconv is redundant (except
for those strings which were previously entered in 8859-1 encoding, and which
need to be converted to UTF-8?).
> Basically the rule is
> UTF8 -> anything 8bit is lossy
> anything 8bit to UTF8 is not lossy
OK - it does make sense to keep the workflow in one format all the way
through.
With your much greater knowledge of this area, do you think the revised system
should cover all bases? (In theory, I mean - obviously from a practical
viewpoint the state of the actual output files may determine the need for
further work.)
> There is a whole seperate story about upper/lower case converting that
> may bite you with other languages (notably Turkish) but are safe on
> Welsh/English
Welsh and English are enough for the moment, thanks :-) I will, however, wish
to come back to a detailed discussion of Turkish some time in 2006 ....
Best wishes
Kevin
_______________________________________________
gnome-cy mailing list
gnome-cy pengwyn linux org uk
http://pengwyn.linux.org.uk/mailman/listinfo/gnome-cy
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]