[Fwd: Re: Translation tools for documentation]

From: "Rodolfo M. Raya" <rmraya maxprograms com>
To: gnome-i18n gnome org
Subject: [Fwd: Re: Translation tools for documentation]
Date: Mon, 29 Mar 2004 19:11:53 -0300
Hi all,

I sent my original message to Danilo by mistake. It was supposed to go
to the list.

Here is Danilo's reply.

Regards,
Rodolfo

-----Forwarded Message-----
From: Danilo Segan <danilo@gnome.org>
To: Rodolfo M. Raya <rmraya@maxprograms.com>
Subject: Re: Translation tools for documentation
Date: Mon, 29 Mar 2004 23:46:52 +0200

Hi Rodolfo,

Данас у 19:38, Rodolfo M. Raya написа:

>> Yeah, I missed on that issue.  Definitely something to consider right
>> away.  I'll see if I can get at least a simple support for
>> sentence-splitting right now (it should be pretty easy, but would
>> probably fail for some not very common cases), though I'm not sure if
>> I'd make it the default.  I'm sure my result won't be on par with the
>> quality of Sun's tools, but it will surely help migration provided one
>> uses it.
>
> Be careful. Segmentation (that's the name of sentence splitting process)
> depends heavily on the language. There are some details that you must
> consider to avoid splitting sentences at the wrong place. Consider the
> following example in Spanish:

Since I'm talking about splitting original strings, it means I'm only
talking about splitting English sentences.  We didn't yet start
accepting anything else as "original" strings, and many other tools
themselves would fail us even if we tried.

I'm not very much sure it'd be easy to split sentences for Serbian
(since ordered numbers use dots themselves, i.e. to say 21st, we use
"21.", and same goes for years, dates, etc) either (and that's my
mother tongue).

Also, I was talking only in the sense of increasing chance of making
translations reusable once we do switch from paragraph oriented to
sentence level translations.  Obviously, a complete algorithm that
would do the job flawlessly is pretty hard to achieve, but getting
*most* of the sentences done would be good enough in such a major
philosophy change, imho.

>> At the same time, I'd create a simple program to merge these
>> sentence-based PO translations into paragraph-oriented ones (that
>> direction should be easy; if the other way around was easy, there
>> would be no need for any of this discussion :).  
>
> I already have a java based tool that converts PO to XLIFF and
> viceversa. It does not support plural forms (no needed in Spanish where
> I use it). I can donate the source to GNOME.

For documentation, there's no need for plural-forms (there're no
variables which are unknown at the time of translation), so that's
not a big issue.  I'm currently not that fond of Java, so making use
of that is probably the job for someone else ;)

Of course, I'm sure many would appreciate you donating it, myself
included. :)

[Not that I'm not stating official Gnome position here, I'm just a
member of Gnome Foundation, of which there are 300 other members, so
I cannot represent the opinion of entire community]

>> Still, since I
>> cannot imagine how are you supporting sentence-reordering with
>> sentence-based translations, I suppose you've discovered that it's
>> very rarely needed, so I guess it would be simple to extract the
>> sentences from the paragraphs in the same way as you do know, even
>> from translations, and map them by order (i.e. 1st sentence from
>> original is mapped to 1st sentence in translation).
>
> My experience working with XLIFF documents tells me that sentence order
> is fundamental.

Indeed, and that's what I was pointing at.  With paragraph-based
translations, one can reorder sentences.  With granularity brought
down to sentences, you cannot reorder them.  Since Sun translators
didn't have problems with that, I guess that translations rarely need
to reorder them, and we can assume that sentences are in most cases
in the same order.  I was actually trying to explain that converting
even from paragraph to sentence level translations, it would be
possible to make a program to do it for most common cases, easing the
job a lot for translators.

> Translators need the ability to convert documents to XLIFF format using
> paragraph segmentation or sentence segmentation. Both options should be
> available. 

I was mainly talking about PO files here.  I do have interest in
XLIFF, but not an immediate interest.

> When converting from XML to XLIFF you can use element segmentation AND
> sentence segmentation.
>
> Formats like DocBook allow insertion of inline elements as tags in the
> paragraph that the translator can reorder at will.

I don't have a problem with that currently.  It seems to me that
current Sun's tools do have a problem with that, since something that
can be inlined in a sentence (eg. entire <itemizedlist>), will look
ugly.  Imagine (I suppose, maybe listitem's need to be nested into
<para>s) valid DocBook snippet: "<para>This doesn't
<itemizedlist><listitem>work,</listitem>
<listitem>stink</listitem></itemizedlist> at all, no matter how hard
you try.</para>"

The sentence here is "This doesn't <li>work,</li> <li>stink</li> at
all, no matter how hard you try".

> We have the following situations here
>
>      1. Existing translations can't be reused if you switch from
>         paragraph messages to sentence messages.

But it's as much true that existing translations can't be reused if
one switches from sentence messages to paragraph messages.  In
different contexts, same sentences would take different forms (eg. in
Serbian they might depend on the gender used in a previous sentence),
so you'd need to hand check them anyway.

We've got the same problem, and I'm not yet convinced that using
sentences is all the way better (indeed, it is for TMs, no dispute
there :).

>      2. Translation Memories based in short isolated sentences are
>         better than paragraph based ones. It is easier to get more
>         matches.

Agreed.

> Should the sentence based approach be used? I think so, but many people
> may object this.

Yeah, I'm not sure on that myself, and paragraph-oriented translation
tools do have some advantages, and some disadvantages.  I'm probably
not very good at weighting them up, but with Gnome documentation
translation stuck at start for a long time (not counting Sun
contributed translations -- I'm mostly worried about translations by
volunteers from the community: or is Sun perhaps planning on starting
a Serbian translation? ;)

>> Any comments or suggestions here from anybody at all?  I'm looking
>> for pretty stable algorithm to split by sentence without engaging
>> something like NLTK (nltk.sf.net I think).
>
> Sun tools have the algorithm. Once they open source the code you can get
> it from there. Send me a private message if you want more info on the
> subject.

Well, I was thinking of something simple, not too hard-core.  As
said, I'm just looking for a migration step which would help in most
cases: I'm not trying to solve it completely at this time, just trying
to ensure that most of the translators' work is not wasted.

Besides, it seems that you sent this as a private mail, if you wanted
it to go to the list (gnome-i18n@gnome.org), you're free to repost my
answer as well. 

Cheers,
Danilo
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]