Re: Handling Translations

From: Christian Rose <menthos menthos com>
To: Joakim Ziegler <joakim ximian com>
Cc: gnome-web-list gnome org
Subject: Re: Handling Translations
Date: Fri, 31 Aug 2001 22:22:17 +0200
Joakim Ziegler wrote:
> >> It seems to me that both these methods are somewhat complex, and, more
> >> importantly, slow.
> 
> > Do you have any data on that? I doubt the gettext solution introduces
> > any user-noticable difference in speed. If in doubt, test it, compare
> > it, produce some data on performerance, and then we can have a
> > discussion. Speculating isn't helping.
> 
> The speed hit from gettext is probably trivial. From reading an XML
> file, it's probably a lot bigger. Remember, a whole new file has to be
> read into memory, parsed by the reasonably large chunk of code an XML
> parser represents, and then you ahve to actually do stuff to merge the
> content with the template. This is most definitely a non-trivial speed
> hit on a machine that at times (around releases) handles 20 hits per
> second at peak.

I trust you on that. My main concern was that you called *both* gettext
and the XML solution slow and complex, which isn't true. Also, gettext
isn't complex compared with what is needed for replacing it with the
same (desperately needed) functionality.


> For gettext, my concern is more with complexity. See below.

I do not agree, see below.


> >> Now, if we think about the fact that if the template
> >> system works as it should, and actual PHP functions are externalized,
> >> then there won't be any common elements between the different language
> >> pages (apart from headers and footers to call templates.)
> 
> >> So why not just use different pages (PHP files) for different languages?
> 
> > Because it sucks. It's very much non-maintainable. How do you get
> > notification of changes? If random hacker X spots an error on the site
> > and commits a fix to cvs, how is translator Y[1-40] to know that spot Z
> > on page W changed, without diffing the entire site periodically, and
> > manually inspect all diffs, and on top of that try to insert
> > corresponding changes in their translated pages? Translators want to
> > translate, not spend most of their time trying to track changes.
> 
> > gettext/xml-i18n-tools/PO format solves these fundamental problems. We
> > have discussed it repeatedly on this list, but you just continue to
> > ignore the problem.
> 
> Please do not make this out to be me trying to make life difficult for
> the translators. Why would you think I want to do that?

<RANT>
I'm not saying that you are purposely trying to make life difficult for
translators. It's just that everytime we discuss translations, your
attitude that you know translation better than translators annoys me a
lot. I keep spending my time on explaining why po format is needed for
translators over and over, and I'm getting sick of it.
</RANT>


> Notification of changes could be done in many ways. Trivially, the page
> is changed if the datestamp of the master page (most likely the English
> one) is newer than the translated pages. There are also more advanced
> ways of tracking it, if we want more complexity.

And how do you mark what has changed? How do you merge translations, so
that the same original word occurring twice, three times, four times,
and so on, only needs to be translated once? How do you partially re-use
existing translations (and mark them for inspection) when a new text,
that is similar to an old already translated one, is added to the pages?

Any solution extracting the page strings to po format, which allows use
of gettext tools on top of that, solves all those problems. These tools
are essential to translators.

Gettext support is already built into PHP, so it's really your
replacement technology that adds complexity.

Moreover, gettext has been around for a long time, it's stable, and
there are large amounts of translation tools based on it's translation
source format, the PO format. Even if you implement a replecement
technology that would do all gettext does, it would still be
incompatible with almost every translation tool in the free software
world made to this day.

Why is it so hard to see that re-implementing the wheel is helping
noone?


> >> I believe it's pretty common to use foobar.en.html and so on.
> 
> > And that doesn't mean that we have to do an inferior and very much
> > broken solution like that when doing a brand new site. Listen to
> > translators for once, and help them help you, instead of ignoring them
> > and their plea for the proper translation interface on purpose.
> 
> Please stop erecting strawmen like this. Why do you think I'm "ignoring
> them and their plea for the proper translation interface on purpose"?

<RANT NUMBER="2">
It's just that I get the feeling that you know better everytime we
discuss what is needed for translators.
</RANT>


> There are some problems I see with gettext, which I think were brought
> up before as well.

It was also explained why they weren't problems in most cases, and how
they are solved in other cases.


> They might not be unsurmountable problems, and if the
> translators want to deal with them, that will be ok. But I do not want
> to be in a situation where the translators end up with a solution that's
> difficult to work with, and we can't change it because it's too late,
> and the reason we ended up with this solution was that we didn't discuss
> the issues enough.

Fair enough. But trust me, translators are usually perfectly happy with
PO format and would prefer it over any other home-brewed replacement
translation system any day.
What causes trouble is in almost all cases not the technology, but
instead the policies of the authors when marking text for translation.
Things like:

	1) What should be marked for translation?
	2) How large should the messages be?
	3) Don't use slang, dialects, unnecessary acronyms
	or otherwise bad (or not easily understandable) English.

I can give you recommendations for all of these:

	1) Basically everything. The exceptions are very few;
	basically only things that are clearly names
	or otherwise appearant that they should not be translated.
	Names that have international characters should be marked
	for translation though, so that they can be encoded in the
	right character set. Also try to avoid including markup
	tags. In most cases this is not possible, but in some
	cases it is. Avoid including them when possible.
	In any case, marking too much for translation is better
	than marking to little.

	2) A page is to much. Single words is always too little,
	unless they do not belong to a text but rather are headings
	or link titles or otherwise standalone words, in which case
	they can be marked as they are. A small to medium-sized
	paragraph is ideal.

	3) I think I don't need to explain this.



> The main problem can be summarized as such: *The nature of text in
> software is very different from the nature of text on webpages*.
> 
> Text in software consists of short, relatively independent strings. This
> is exactly what gettext is created to manage. When a string changes,
> you'll know, and you can translate that string again.
> 
> On the other hand, text on webpages is prose. It's long passages of
> text, and it's *highly interdependent*.

No, the weak spot of gettext is very, very short strings (lack of
context). It's ideal for short paragraphs (usually more than enough
context).
I also translate documentation (not for GNOME though), and I have yet to
see any documentation that has too long paragraphs for translation.
Documentators know that long paragraphs are problematic for readers, so
the paragraphs happen to be just the right size for translators too.
Also, the interdependency is not a problem, it is not a problem in
documentation, so I fail to see why it would be for web pages. See
below.


> So when you use gettext for web page text, you need to make a decision.
> You can either make each translatable string short (like one paragraph
> at the most, maybe less),

Yes, a short to medium-sized paragraph is ideal.


> and it'll be relatively manageable in the
> sense of what strings will be reported as changed. However, this will
> mean that the translator will lack a lot of context when doing
> translations,

No. The context from the language in a paragraph is usally more than
enough. In the rare cases where that's not true it's easy to look up (po
format usually has references to where the original string occurs).
That's no different between software translation or documentation
translation. I have a very difficult time beleiving it would be
different for web pages.


> and given that translation to another language is never a
> 1:1 process, this can lead to clumsy prose and other problems (for a
> simple example, consider the problem of overusing a term or phrase in a
> span of text).

If the paragraphs are not single phrases, this is usually not a problem.
Translation is usually amazingly close to a 1:1 mapping if you look at
whole paragraphs. Single sentences may be reversed and rearranged
"internally" (and that happens quite often because of writing rules of
different languages) but if you look at paragraph level the content is
hardly ever different (if it is, then the translation is a bad one).
Single sentences may be reversed or words rearranged, but the sentences
usually always come in the original order, and in the places where the
original refers to "the banana" the translation does to, and where it is
referred as "it" the translation does too. Overuse of phrases is rarely
a problem, unless it is a problem in the original.


> The other option is to use longer gettext translatable strings, maybe
> the whole body text of the page is one string.

Oh please, no. That's insane amounts of "context", and usually hundreds
of times more than what is actually needed to translate correctly.


> This means that there
> will definitely be enough context for the translator to create a
> high-quality translation

Instead it is hardly maintainable, and increases the rate of errors
dramatically. It's far easier to spot a typo working with a single
paragraph, instead of a whole page at once. It's also far easier to spot
a mistranslation. A common thing that happens when messages are so big
that they span multiple paragraphs or a whole page is also that it's
easy to accidentally "forget" whole paragraphs in the translation, and
that's actually much more difficult to spot than single typos.


> but now there's a different problem: The
> string that will be reported as changed is very long, so it can be hard
> to see exactly what changed (for instance if a typo was fixed or some
> other minor change was done), and also, there will be lots of markup in
> the translatable string. Stuff like paragraph breaks, table structures,
> etc., etc. And the translator will have to edit between this stuff
> without the benefit of having the HTML (or PHP) context of the whole
> document, so it'll be very easy to break stuff.

Yes, yes, yes. Marking a whole page or multiple pragraphs as a single
chunk for translation is so bad an idea that it's hard to list all the
reasons. If anyone on this list has had this idea, please forget it
immediately! ;)


> So there are definitely issues with using gettext. I don't understand
> why people have to be demonized for pointing them out.

What you don't seem to realize is that what you point out is issues with
policy on what is to be marked for translation, and how, not problems
with the technology itself. You'll have all these problems you pointed
out in any other method for translation, but with the difference that
translators will have none of the needed features of the gettext tools.


Christian
Follow-Ups:
- Re: Handling Translations
  - From: Joakim Ziegler
References:
- Handling Translations
  - From: jeichorn
- Re: Handling Translations
  - From: Joakim Ziegler
- Re: Handling Translations
  - From: Christian Rose
- Re: Handling Translations
  - From: Joakim Ziegler
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]