Re: About translating documents (.xml/.sgml) in GNOME



Malcolm,

>Cool! More feedback. I seem to have created a monster here, though ...
>people with experience in this field appear to be raising valid issues.
>
>I really appreciate all the comments, though and have tried to clear up
>any places where there may be confusion.
>

Happy to give you feedback! :-)

>OK, here we (you and I) have a small deviation in design goals. My
>intention is _not_ to work with fully general SGML syntax. The main
>focus, initially, is to handle DocBook-XML and HTML, since they are the
>two formats that are needed for GNOME translations. You get all XML
>documents for free at that point since it makes no sense to just write
>DocBook-specific code when I have a full XML parser at my disposal.
>

True. I'm mainly interested in SGML DocBook.

>The HTML handling requires some special cases, due to optional closing
>tags, etc, but I am going to do exactly that -- special case them.
>Handling fully general SGML documents is maybe a goal for the future,
>but it's not something I am going to touch initially because it adds
>soooo much to the original problem. Right now, I am designing to solve a
>particular problem. However, I am not going to write code that makes it
>impossible to extend things in the future (to handle all SGML, etc).
>

That's fair enough.

>It does in the DocBook case. For other XML documents you will need to
>specify them explicitly. I was using "programlisting" as an example of
>how such a tag might exist. This is just a case where I can implement
>all XML documents "for free", so adding support for them in the runtime
>configuration is necessary and not too hard.
>
>I think we are agreement here that the --docbook option should do
>everything expected on a normal DocBook document.
>

Yes.

>This is sort of the point that Sander Vesik raised as well, I think
>(w.r.t things like <guilabel>). Tags like <emphasis> are inline anyway,
>so they stay. They are important to the translation context. Sander's
>earlier mail and yours has highlighted the necessity of creating a
>category of tags that are always going to have their contents translated
>the same way.
>

Agreed.

>All good points. I had not thought of this. Will add them to the design.
>Again, it comes down to a matter of representing that there is a
>footnote in the plain text and if the translation reorders things, the
>footnote insertion point needs to be moved as well. That is not hard to
>do, though.
>

Agreed.

>Agreed. That was my intention.
>

Ok then! :-)

>I expect that, in practice, it will work the same as with comments in
>source code attached to comments: the comment is added after a few
>translators have sent email wondering what the string is meant to be
>saying. I will add the functionality so it can be used if necessary.
>

Good point.

>Once we are splitting things up along the lines of what I earlier called
>"block elements" (essentially one msgstr = one paragraph), the updates
>are pretty easy, except in one case. When a paragraph is split in two or
>two paragraphs are merged, things will probably go a bit pear shaped.
>But I think I can do _something_ sensible here to at indicate that this
>might have been what possibly happened.
>
>Your feeling is probably right, though -- merging at the po-file level
>is probably the easiest way.
>

Yes, I think that's easier.

>That is more or less the way gettext does it as well (under the covers).
>It makes sense to me.
>

Good thing it does it that way! :-)

>My gut reaction is "no", but I can see what you are talking about. I
>want to keep the source document _completely_ unchanged, including the
>original language it was written in. This enables us to take functional
>web pages, third-party documents and so on and translate them
>"transparently". However, in reality, this is exactly how I am
>intending to create the translated versions -- although the version you
>describe above will probably only exist as a virtual document during the
>document creation phase for each locale.
>

That's fair enough. I was already thinking about errors in the original 
text. It would be easier to fix them in the po's themselves, since then 
you didn't have to do a new update, cause you fixed a typo. That's 
nothing major though.

>First point: all of this section is me "thinking out loud", so there is
>some repeating of my ideas. I must admit that I posted this spec fairly
>quickly after coming into work and seeing that the topic had arisen on
>the list. What you have been reading is a document I originally wrote
>just for myself to order my thoughts.
>
>Second point: the case where extra text will be difficult is when the
>extra text includes extra block-level tags (which you discuss below).
>In the other cases I can think of, it will not be a problem, though.
>

Agreed.

>When I wrote "block elements", I meant portions of text that were
>surrounded by tags like <para>, yes.
>

Ok. What about <entity>, etc.?

>Here I disagree with you, because I think you have missed an important
>corner case. In many texts I have read that are translated from other
>languages, the translator has often added useful footnotes to explain
>some portion of the text that may not carry over well otherwise. For
>example, explaining some idiom or cultural reference that is common in
>the source language but may be unknown in the target locale. Sometimes
>this problem can be avoided by translating in a different way, but on
>occasion this will throw away the flavour of the document at the same
>time.
>
>Some Russian texts are full of really nice sayings and proverbs. These
>do not always make sense in English, but translating them away would
>ruin the text in some sense, so a footnote explaining the saying and its
>origin is a good substitute. Now, this is not going to be a really
>common problem in technical documentation, but it is not inconceivable.
>
>Another case where extra elements may be required is in a list of steps
>to follow. In English, there may be less steps than in, say, Chinese,
>when some explanation of setting up an alternate input method may also
>be required. So the translator will need to add extra <listitem> tags
>(and since listitems contain paragraphs, they end up being one of more
>separate translatable items in my scheme).
>
>Allowing these necessary additions while prohibiting other more
>arbitrary changes seems hard, so if you really see this as an obstacle,
>maybe you can set out how you see it working
>

Of course, you are nothing but right!
This, of course, (and that's the reason I really don't like this idea) 
brings all the problems. If translation is done in PO-style, you have to 
create new entries. Entries for which there is no original. How do you 
want to handle this? Make up unique dummy-strings to be used as msgid's? 
These not only have to be unique to the file, but unique to the 
database, since else you cannot really use a po-database, else you'll 
get all the wrong placements. Alternatively, use the translation not 
only as msgstr, but also as msgid? What do you do when you have an 
update to the original and try a new merge? Place a unique character 
sequence in front of the translation to be able to determine the things 
you have to leave out? Like the opposite of not marking it for 
translation? Well, it's doable but not very nice, as far as I can see. 
Next, you will have to extend your tool (we're using kbabel), since 
adding new entries is generally not supported. That requires more 
engineering effort on a third-party tool.

IMO, if a document is anticipated to undergo quite a few such changes, 
then I don't think PO-files are a good option. I'd rather convert the 
whole thing into something that gives me something a little more 
WYSIWYG. PO-files I believe is a good option for translations, where the 
structure of the document remains the same within the translation process.

But maybe that's just me? ;-)

>I don't really care who does it -- I just want to implement a way it can
>be done. This particular issue (translating only portions of otherwise
>untranslatable blocks) is one I really do not have a any idea about how
>to resolve. Right now, there is a basket labelled "too hard" and it
>contains this item alone. For everything else, I have at least a first
>approximation of a solution.
>  
>

Whenever you add a paragraph as msgid, check for certain tags within the 
paragraph. If these certain tags are within the paragraph, look up your 
PO-database for each and everyone of them. Replace the text in between 
these tags and place the result (english text, translated tags) as 
msgstr and mark it fuzzy. That's what I'd do (given you don't find a 
larger portion of the text already in the database).

>While I realise (or at least hope) you are partially tongue-in-cheek
>here, let me be explicit for everybody else: I want to extract the
>translatable portions from the original, unadulterated markup. Unlike
>marking code for internationalisation purposes, I do not want to require
>special markup in the sources. So all of the work about what tags to
>include in the strings to be translated and where to break the strings
>and what to translate is to be done by this mythical tool we are
>designing..
>

Completely agree. We cannot expect anyone to place any special tags in 
there, neither can we expect them to have no errors in tags, or to have 
placed all tags correctly (and really all of them).

>[By the way, are you on gnome-i18n-list or do you want to be CC'd on any
>future significant developments here? I will at least pull together
>another design spec based on the feedback so far and knock up some code
>to show none of this is impossible this weekend.]
>

If you mean gnome-i18n@gnome.org, then yes. I'm on the list.

And I'm waiting for the code!! :-)

Cheers,
Bernd

-- 
Dr. Bernd R. Groh <bgroh@redhat.com>
Red Hat Asia-Pacific
Disclaimer: http://apac.redhat.com/disclaimer

"Everything we know is an illusion,
 nothing we know is real,
 nothing real we can know,
 illusion is what we call reality."





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]