Re: Translation tools for documentation



Hey there,

On Mon, 2004-03-29 at 17:07, Danilo Segan wrote:
> Yeah, I see.  You've probably covered all the possible corner-cases as
> well (like what if in some language "You can do:" needs to be split so
> that list comes in the middle); or at least, you have enough of a
> knowledge that it doesn't happen in any of the languages (I don't have
> the knowledge, so I try to handle even the cases like that).

Nope, actually we haven't done that yet ! That's a pretty hard thing to
achieve, and get right all the time - it'd be an interesting problem to
work on though. Basically, you're talking about sub-flows :

"Here is < some list > an interesting list"

XLIFF provides support for it alright :

http://www.oasis-open.org/committees/xliff/documents/xliff-specification.htm#sub

it's a little harder to write the segmentation behaviour, and work would
need to be done on the editor side to support this, but nothing's
impossible. So far, our translators haven't seemed to mind about this
lacking functionality, but it'd be fun to investigate alright.

> > Absolutely. Here's my main worry (and maybe it's not such a big deal,
> > hard to tell) - if you start with translation tools that work at a
> > paragraph level, you're kinda stuck there.

> Yeah, I missed on that issue.  Definitely something to consider right
> away.  I'll see if I can get at least a simple support for
> sentence-splitting right now (it should be pretty easy, but would
> probably fail for some not very common cases)

That would be excellent - even if it's a fairly quick and dirty sentence
splitter, it would still mean you'd get better fuzzy matches when using
a more complex sentence splitter (instead of paragraph segments which
would result in very few fuzzy matches)

> I'm sure my result won't be on par with the
> quality of Sun's tools, but it will surely help migration provided one
> uses it.

Our tools are far from perfect, but thanks for the vote of confidence
;-)


> Still, since I
> cannot imagine how are you supporting sentence-reordering with
> sentence-based translations, I suppose you've discovered that it's
> very rarely needed

Yep - translators don't seem to mind that it's not there.

> I'm not yet decided on these, and I'll appreciate any further input.
> For those into technicalities, I'm simply planning to split a message
> already destined for PO file

That looks fine.

> Any comments or suggestions here from anybody at all?  I'm looking
> for pretty stable algorithm to split by sentence without engaging
> something like NLTK (nltk.sf.net I think).

I'd go with NLTK or something similar if you can (since the hard work is
already done).

We were using an older TM system before implementing this new one, and
so to get maximum leverage from our old system, I had to write a
sentence segmenter from scratch using javaCC (that was !fun) which
closely matched the segmentation behaviour from the old system. It's not
perfect, but seems to handle most cases okay. If I was starting from
scratch, I'm not sure where I'd look. (probably google, for starters)

> > * format checking - if you're missing a tag in the translation that was
> > in the source, it'll tell you.
> 
> I'd like to see that solved for gettext as well: "xml-format" tag for
> GNU gettext (like c-format), which would indicate that string needs to
> be correct XML.

Does it do printf-style format checking at the moment ? I've got a
parser that recognises these which I need to build into the editor at
some stage so we can nicely highlight things like "%%+05d".

> Well, with Emacs po-mode

Aah emacs, it's wonderful !

> So basically, PO file format is not that bad itself, though it lacks
> the tight coupling between tools and format, which some other, more
> professional tools and their corresponding formats have.

Yep, I agree - it's just lacking standardisation. I mentioned this a
while back, but I think people tried to get either gettext/po format
into POSIX, but there was an opposing party looking to do the same thing
with catgets/msg format and discussions broke down. My hope is that
XLIFF will become the format that people will use for their
translations, and then which ever i18n API is in use, we can just output
to that format : be it .po .msg .properties .java .xml (?) .dtd -
whatever. (but I'll stop marketeering now ;-)

	cheers,
			tim


-- 
Tim Foster - Translation Technology Engineer, Software Globalization
http://sunweb.ireland/~timf
http://www.netsoc.ucd.ie/~timf




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]