Re: [EWMH] _NET_WM_WINDOW_TYPE_AUXILIARY



On 2007-10-18, Russell Shaw <rjshaw netspace net au> wrote:
> I find it hard to see those problems because i rarely handle non-english
> text.

Which problems? The ones with present abstraction implementations
(wchar_t, locale), or the general unknown encoding fuckup?

> In the general-purpose editing applications i've made (like a word processor),
> any non-english text is passed out to a "black box" unicode layout processor
> plugin for things like paragraph formatting, and i can make it UTF-8 or UTF-32
> or whatever data encoding is convenient. I see "all UTF-8" as only applying
> between completely separate applications on the pc.

It applies to any software components trying to communicate. Things
like DBUS (iirc) and Cairo in their monoculturism require the use of
UTF-8 with their API. Those are the ones I have studied and become
disappointed with. There are probably many others (everything gnome
related?) as well.

> I've done hardly any non-english processing, but iirc, UTF-8 files are supposed
> to start with a magic number. If all text files were UTF-8, the magic number
> wouldn't be needed. I'm probably missing something you mean.

Text files on *nix do not tend to carry any information as to their 
character encoding, or type in any other way either. They're randomly 
assumed to either be ASCII + random bytes with high bit set, locale 
encoding, or these days UTF-8,  depending on the application. On Windows
AFAIK they do have some kind of unicode markers, and maybe there's some
standard about that, but any random text file on *nix tends to be in the
locale encoding without indicators if it was  created on that system 
(by that user) when the same locale was in use. But files from elsewhere
can use different encoding, and some formats stored in plain text files
require a particular encoding to be used without indicating it anywhere
in the file (e.g. YAML).

> I find it hard to see how all kinds of config files in /etc called be made
> non 7-bit ascii without major parsing pain. To me, config file tokens should be 
> in 7-bit latin because the content is more like program code that only 
> programmers should see, and any non-english configuration should be done through
> an i18n-ized gui imo (not having thought of anything better).

A case could probably be made for config file tokens to be 7-bit ASCII.
But the files contain data strings as well, including things like 
translations of menu items and such. Their encoding can be 
application-specific, but wouldn't it be simpler for the file to
specify its encoding in a standard manner? Then arbitrary text editors
can use the right encoding (or convert to whatever encoding they please).

HTML/XML/etc. do, for example, tend to include a Content-Type or such
encoding specification, but unfortunately few text editors understand 
it (and the SG/XML syntax generally sucks anyway and isn't suitable
for editing by text editors -- yet there's nothing better either -- and
could hence be binary). Arbitrary plain text files could include the 
same information in a more easily accessible format. One rather hacky
and ugly option might be using on the first line the -*- foo: bar; -*- 
syntax that some text editors do already support. Another cleaner 
option could be based on storage of mime types on the file system.

...

But this is really drifting away from the topic of this thread
and perhaps even the whole list, and should perhaps be taken
elsewhere.

-- 
Tuomo



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]