Re: Translation tools for documentation



Hi Tim, others,

Today at 15:58, Tim Foster wrote:

> In general, we use the structure of the input format when possible to
> define what are blocks of translatable text : things like paragraphs
> with inline <emphasis> tags would make up a contiguous block of text.
> The Docbook spec pretty much defines what's a block-level tag and what's
> inline (has it on page 33 of my copy of the 1st O'Reilly duck book).
>
> Then we run a sentence segmentation algorithm on that block, resulting
> in segments. That's over simplifying a little, but you get the general
> idea.

Yeah, I see.  You've probably covered all the possible corner-cases as
well (like what if in some language "You can do:" needs to be split so
that list comes in the middle); or at least, you have enough of a
knowledge that it doesn't happen in any of the languages (I don't have
the knowledge, so I try to handle even the cases like that).

> Absolutely. Here's my main worry (and maybe it's not such a big deal,
> hard to tell) - if you start with translation tools that work at a
> paragraph level, you're kinda stuck there. That is, you've got .po files
> with one paragraph per msg from one release. If you change to sentence
> level segmentation in the next release, you're not going to get the same
> level of reuse from those old po files with paragraph level segments
> without doing some pretty hairy alignment-procesing between them (that
> is, finding which sentences in each source paragraph correspond to which
> sentences in each translated paragraph, in order to build a useful TM.)

Yeah, I missed on that issue.  Definitely something to consider right
away.  I'll see if I can get at least a simple support for
sentence-splitting right now (it should be pretty easy, but would
probably fail for some not very common cases), though I'm not sure if
I'd make it the default.  I'm sure my result won't be on par with the
quality of Sun's tools, but it will surely help migration provided one
uses it.

At the same time, I'd create a simple program to merge these
sentence-based PO translations into paragraph-oriented ones (that
direction should be easy; if the other way around was easy, there
would be no need for any of this discussion :).  Still, since I
cannot imagine how are you supporting sentence-reordering with
sentence-based translations, I suppose you've discovered that it's
very rarely needed, so I guess it would be simple to extract the
sentences from the paragraphs in the same way as you do know, even
from translations, and map them by order (i.e. 1st sentence from
original is mapped to 1st sentence in translation).

I'm not yet decided on these, and I'll appreciate any further input.
For those into technicalities, I'm simply planning to split a message
already destined for PO file, so in the example from my previous
mail, I'd have following messages that would pass through
"sentence-splitter" [this is how xml2po currently works] before being
output into PO file:

msgid "First thing to do"
msgid "Second thing to do"
msgid "Other thing to do"
msgid "You can do: <placeholder-1/> Any of these things will achieve something."

Obviously, any simple sentence-splitter I'll come up wouldn't change
anything here, but in other, more common cases it would help.

Any comments or suggestions here from anybody at all?  I'm looking
for pretty stable algorithm to split by sentence without engaging
something like NLTK (nltk.sf.net I think).

> Absolutely - I agree. For now, you're doing the right thing with po
> files I think : just need to try not to get locked into one way of doing
> things and keep in mind the best way.

Yeah, thanks for helping with that :)

On the topic of features, I'll comment with my own wishes for
gettext: I agree they're all great features, and I'm trying to see
what's implementable for PO files as well, so it might not be too
interesting for you.

> * format checking - if you're missing a tag in the translation that was
> in the source, it'll tell you.

I'd like to see that solved for gettext as well: "xml-format" tag for
GNU gettext (like c-format), which would indicate that string needs to
be correct XML.  Of course, since msgfmt is not an editor, it would
be up to any tools we use to check that (msgfmt checks c-format, but
translators sometime forget to do that; msgfmt is also called from 
Emacs PO-mode if you press "V" [verify]).  Somewhat orthogonal
approach would be for editing tools to actually verify the content,
instead of gettext-provided msgfmt.

> * integrated "mini" TM - if you've already translated something, and
> come across the same or similar translation further down the file, it'll
> suggest that as a possible translation

Yeah, great feature, it would help tremendously, especially with
documentation (in my experience of translating two small documents ;).

This is the one feature which very much depends on the tool
used, so that's where Sun's software has the edge.  I believe
Gtranslator has something like that, but I was never able to make
that work for me.

> * uses aspell - has spel checqing too :-)

Well, with Emacs po-mode, it's trivial to add spell-checking to the
relevant hook (see, liek I do with my mial :) for it to check only the
translation (not the original string, of course).  I don't know about
other tools.

> * status of translation - can mark a translation as "needing review" or
> other status types

Standardized way to do that with PO files is also welcome.  Some
translators from Serbian team have in the past used "fuzzy" marking
for that, which only creates a bigger mess (you don't know which
message needs a review once msgmerge marks fuzzy newly found
"similar" translations).  So, clearly, there's a need for that.

> * easy navigation - can easily jump to next untranslated, unreviewed,
> fuzzy match or 100% match string

That's provided with most translators' tools using PO format as a
base, at least those I tried.  Except for the "unreviewed" ones, which
is obvious since there's no standardized way to mark translations as
needing review.

> * translator comments - can add a comment to any translation

PO file format supports it, po-mode does as well, not sure about
others (I doubt many other tools support it) -- I use it currently to
mark strings that might need review, and to mark if there's a bug in
the original string which I need to report to bugzilla.

So basically, PO file format is not that bad itself, though it lacks
the tight coupling between tools and format, which some other, more
professional tools and their corresponding formats have.

Cheers,
Danilo



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]