On Mon, 2004-03-29 at 13:07, Danilo Segan wrote: > Hi Tim, others, Hi, > Today at 15:58, Tim Foster wrote: > > > In general, we use the structure of the input format when possible to > > define what are blocks of translatable text : things like paragraphs > > with inline <emphasis> tags would make up a contiguous block of text. > > The Docbook spec pretty much defines what's a block-level tag and what's > > inline (has it on page 33 of my copy of the 1st O'Reilly duck book). > > > > Then we run a sentence segmentation algorithm on that block, resulting > > in segments. That's over simplifying a little, but you get the general > > idea. > > Yeah, I see. You've probably covered all the possible corner-cases as > well (like what if in some language "You can do:" needs to be split so > that list comes in the middle); or at least, you have enough of a > knowledge that it doesn't happen in any of the languages (I don't have > the knowledge, so I try to handle even the cases like that). > > > Absolutely. Here's my main worry (and maybe it's not such a big deal, > > hard to tell) - if you start with translation tools that work at a > > paragraph level, you're kinda stuck there. That is, you've got .po files > > with one paragraph per msg from one release. If you change to sentence > > level segmentation in the next release, you're not going to get the same > > level of reuse from those old po files with paragraph level segments > > without doing some pretty hairy alignment-procesing between them (that > > is, finding which sentences in each source paragraph correspond to which > > sentences in each translated paragraph, in order to build a useful TM.) You are absolutely right here. Existing TM will not be good for reuse. > Yeah, I missed on that issue. Definitely something to consider right > away. I'll see if I can get at least a simple support for > sentence-splitting right now (it should be pretty easy, but would > probably fail for some not very common cases), though I'm not sure if > I'd make it the default. I'm sure my result won't be on par with the > quality of Sun's tools, but it will surely help migration provided one > uses it. Be careful. Segmentation (that's the name of sentence splitting process) depends heavily on the language. There are some details that you must consider to avoid splitting sentences at the wrong place. Consider the following example in Spanish: El Sr. y La Sra. desean salir. The above sentence has 2 abbreviations and you can't split it. You need a dictionary of abbreviations valid in each language you wish to support. > At the same time, I'd create a simple program to merge these > sentence-based PO translations into paragraph-oriented ones (that > direction should be easy; if the other way around was easy, there > would be no need for any of this discussion :). I already have a java based tool that converts PO to XLIFF and viceversa. It does not support plural forms (no needed in Spanish where I use it). I can donate the source to GNOME. > Still, since I > cannot imagine how are you supporting sentence-reordering with > sentence-based translations, I suppose you've discovered that it's > very rarely needed, so I guess it would be simple to extract the > sentences from the paragraphs in the same way as you do know, even > from translations, and map them by order (i.e. 1st sentence from > original is mapped to 1st sentence in translation). My experience working with XLIFF documents tells me that sentence order is fundamental. Translators need the ability to convert documents to XLIFF format using paragraph segmentation or sentence segmentation. Both options should be available. > I'm not yet decided on these, and I'll appreciate any further input. > For those into technicalities, I'm simply planning to split a message > already destined for PO file, so in the example from my previous > mail, I'd have following messages that would pass through > "sentence-splitter" [this is how xml2po currently works] before being > output into PO file: When converting from XML to XLIFF you can use element segmentation AND sentence segmentation. Formats like DocBook allow insertion of inline elements as tags in the paragraph that the translator can reorder at will. > msgid "First thing to do" > msgid "Second thing to do" > msgid "Other thing to do" > msgid "You can do: <placeholder-1/> Any of these things will achieve something." > > Obviously, any simple sentence-splitter I'll come up wouldn't change > anything here, but in other, more common cases it would help. We have the following situations here 1. Existing translations can't be reused if you switch from paragraph messages to sentence messages. 2. Translation Memories based in short isolated sentences are better than paragraph based ones. It is easier to get more matches. Should the sentence based approach be used? I think so, but many people may object this. > Any comments or suggestions here from anybody at all? I'm looking > for pretty stable algorithm to split by sentence without engaging > something like NLTK (nltk.sf.net I think). Sun tools have the algorithm. Once they open source the code you can get it from there. Send me a private message if you want more info on the subject. Regards, Rodolfo
-- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms |