Re: PO-based Documentation Translation

From: Gudmund Areskoug <fta algonet se>
To: Tim Foster <Tim Foster Sun COM>, gnome-i18n gnome org
Subject: Re: PO-based Documentation Translation
Date: Fri, 03 Oct 2003 14:40:07 +0200
Hi,

Tim Foster wrote:

> Hi there,
> 
>>Yes, I see your point. It would really help in some cases, and there  
>>are major benefits. Still, translation of documentation requires (to  
>>me) a bit more freedom than just sentence-for-sentence (yes, I may  
>>choose to translate one sentence to two, or one to a 'blank').
> 
> Yes, this is where things get complex - you're talking about 1:m n:m or
> m:1 matching, where one source sentence corresponds to many target
> sentences, many source sentences correspond to many target sentences or
> many source sentences correspond to 1 target sentence.

terminology hierarchy + context -> match priority.

> Again, I'd stress the advantages to sentence level segmentation from an
> ease-of-automation point of view : doing fuzzy searches on paragraphs is
> more computationally intensive, and is less likely to achieve high
> leverage unless you're really storing sentences behind the scenes and
> are jumping through hoops to convince translators that they're
> translating paragraphs !
---snipping from various messages---
>> suppose we have our 4 sentence paragraph :
>> 
>> msgid "a,b,c,d"
>> 
>> It sounds like you're suggesting that we split these before doing a
>> lookup, and then merge them again - no problem, we separate out the
>> sentences and matches for a,b,c and d
>> 
>> giving us, perhaps the translated message file :
>> 
>> msgid "a,b,c,d"
>> msgstr "e,f,g"
>> 
>> - now, the trick is, in order to repopulate our database with the
>> translated messages, we need to find out which sentence in the source
>> matches which sentence in the target. Now, there's algorithms out there
>> [1] that can help with this, but most of the time they need to be
>> checked to make sure they produce the right output.
>> 
>> a,b --> e [two source sentences matching one translated sentence !]
>> c --> f
>> d --> g

Almost: let the translator decide whether to split and/or join
segments, as far as it's possible with concern to the format.

Then have all those segment pairs (a, b, --> e, c --> f, d --> g and 
abcd --> efg) saved to TM as you go along, e. g. upon leaving one 
finished segment and going to the next and/or by keyboard shortcut, 
possibly first saving to a TM buffer, so that no unapproved segment 
pairs reach the main DB.

That way, the translator can choose what segment pairs are "good to
go", regardless of whether they were originally in one or many
segments, or of how many and what segments they will eventually end
up in.

> Funny you should mention that - we're working on such a solution. (it's
> quite complex though)
> 
> One of the main problems with uptake of translation memory that we've
> encountered internally, and is a major thorn in the side of TM systems
> is where translators insist on rewriting documents. Small stylistic
> changes can result in the translation (t) not really being a good match
> for sentence (b) except in the context of other sentences (a) and (c).
> 
> This is really something to watch out for - that translations are really
> good quality translations, and not a rewriting of the original sentence,
> perhaps with additional explanation for the target language audience.

That brings up two QA questions that (IMveryHO ;) is rather well
handled in the Swedish team:
1. A very large degree of peer review - sadly very often absent in
pro translation (copyright, secrecy etc...). This is facilitated by
the simple plain text format of the po files, since they can be
reviewed straight in an e-mail.
2. If there's a problem with the localizability (does this word
exist?), a bug report or some (hopefully kind) notification is sent
to the author.

> This sort of brings us into another realm of translation technology that
> noone's mentioned before - translatability & controlled language : where
> the writers of the source document attempt to write to a set of rules to
> make translation easier, you could think of it as i18n guidelines for
> tech-writers. 
> 
> I have to say, I'm not an expert in this field, but it might be worth
> looking into if you haven't already...

I have :). Pity I'm not a programmer... (some claim the world's
probably lucky that way ;).

AFAIK, guidelines already exist here, question is rather if they're 
in a place where they get the previously non-i18n-aware developers 
attention. I've met fairly many developers who do open source that 
never heard of gettext or don't know it well enough to use it. Don't 
know how it is with documentation.

Next issue, controlled language. Most companies I've done work for 
who've implemented such things have ultimately dropped them, because 
they too often were imposed on the technical writers from above by 
some external consultant (non-real-world ;) ).

Where it has worked fairly well, is where the system was very mild, 
adapting to the writer or where it was completely dictatorial (write 
this way or get fired, more or less), or where it came from the 
writers themselves or from their wishes (e. g. don't stun my 
creativity and freedom, don't mess with my "flow" feeling).

The last scenario is what could grow here, and is IMHO the best. 
Possibly, the focus should rather be on helping an author not to mix 
terms up, and aid in registering what new elements get introduced.

Don't forget someone should eventually want to read the stuff - with 
pleasure, if possible.

>>Certainly, I believe it would work great in most cases, but there is  
>>simply a small number of cases where it would not work at all. And  
>>that's what bothers me. Yes, benefits are huge, but what to do when you  
>>get stuck in one of those situations?
> 
> Well, so far, we haven't really found it to be a problem : at the end of
> the day, when the file has been converted back to it's source format,
> you can always post-edit it : nothing's preventing you from doing that.
> My feeling is that this doesn't come up in technical documentation as
> much as it could in a more prosaic type of text, but I do understand the
> problem and we are working on it.

:)... there have been comparative surveys and analyses made on 
technical texts and belletristic prose. Funny thing is, it often 
turned out that prose was more repetitive and uniform in ways of 
expression than e. g. many manuals.

All this said, I've started wondering if this is the right forum for 
these discussions (perhaps it is?), or if there is a more 
appropriate one (that's perhaps still not tied to a specific tool)?

BR,
Gudmund
References:
- Re: PO-based Documentation Translation
  - From: Ismael Olea
- Re: PO-based Documentation Translation
  - From: Danilo Segan
- Re: PO-based Documentation Translation
  - From: Tim Foster
- Re: PO-based Documentation Translation
  - From: Danilo Segan
- Re: PO-based Documentation Translation
  - From: Tim Foster
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]