Re: About translating documents (.xml/.sgml) in GNOME



Malcolm,

Cool! More feedback. I seem to have created a monster here, though ...
people with experience in this field appear to be raising valid issues.

I really appreciate all the comments, though and have tried to clear up
any places where there may be confusion.


Happy to give you feedback! :-)

OK, here we (you and I) have a small deviation in design goals. My
intention is _not_ to work with fully general SGML syntax. The main
focus, initially, is to handle DocBook-XML and HTML, since they are the
two formats that are needed for GNOME translations. You get all XML
documents for free at that point since it makes no sense to just write
DocBook-specific code when I have a full XML parser at my disposal.


True. I'm mainly interested in SGML DocBook.

The HTML handling requires some special cases, due to optional closing
tags, etc, but I am going to do exactly that -- special case them.
Handling fully general SGML documents is maybe a goal for the future,
but it's not something I am going to touch initially because it adds
soooo much to the original problem. Right now, I am designing to solve a
particular problem. However, I am not going to write code that makes it
impossible to extend things in the future (to handle all SGML, etc).


That's fair enough.

It does in the DocBook case. For other XML documents you will need to
specify them explicitly. I was using "programlisting" as an example of
how such a tag might exist. This is just a case where I can implement
all XML documents "for free", so adding support for them in the runtime
configuration is necessary and not too hard.

I think we are agreement here that the --docbook option should do
everything expected on a normal DocBook document.


Yes.

This is sort of the point that Sander Vesik raised as well, I think
(w.r.t things like <guilabel>). Tags like <emphasis> are inline anyway,
so they stay. They are important to the translation context. Sander's
earlier mail and yours has highlighted the necessity of creating a
category of tags that are always going to have their contents translated
the same way.


Agreed.

All good points. I had not thought of this. Will add them to the design.
Again, it comes down to a matter of representing that there is a
footnote in the plain text and if the translation reorders things, the
footnote insertion point needs to be moved as well. That is not hard to
do, though.


Agreed.

Agreed. That was my intention.


Ok then! :-)

I expect that, in practice, it will work the same as with comments in
source code attached to comments: the comment is added after a few
translators have sent email wondering what the string is meant to be
saying. I will add the functionality so it can be used if necessary.


Good point.

Once we are splitting things up along the lines of what I earlier called
"block elements" (essentially one msgstr = one paragraph), the updates
are pretty easy, except in one case. When a paragraph is split in two or
two paragraphs are merged, things will probably go a bit pear shaped.
But I think I can do _something_ sensible here to at indicate that this
might have been what possibly happened.

Your feeling is probably right, though -- merging at the po-file level
is probably the easiest way.


Yes, I think that's easier.

That is more or less the way gettext does it as well (under the covers).
It makes sense to me.


Good thing it does it that way! :-)

My gut reaction is "no", but I can see what you are talking about. I
want to keep the source document _completely_ unchanged, including the
original language it was written in. This enables us to take functional
web pages, third-party documents and so on and translate them
"transparently". However, in reality, this is exactly how I am
intending to create the translated versions -- although the version you
describe above will probably only exist as a virtual document during the
document creation phase for each locale.


That's fair enough. I was already thinking about errors in the original text. It would be easier to fix them in the po's themselves, since then you didn't have to do a new update, cause you fixed a typo. That's nothing major though.

First point: all of this section is me "thinking out loud", so there is
some repeating of my ideas. I must admit that I posted this spec fairly
quickly after coming into work and seeing that the topic had arisen on
the list. What you have been reading is a document I originally wrote
just for myself to order my thoughts.

Second point: the case where extra text will be difficult is when the
extra text includes extra block-level tags (which you discuss below).
In the other cases I can think of, it will not be a problem, though.


Agreed.

When I wrote "block elements", I meant portions of text that were
surrounded by tags like <para>, yes.


Ok. What about <entity>, etc.?

Here I disagree with you, because I think you have missed an important
corner case. In many texts I have read that are translated from other
languages, the translator has often added useful footnotes to explain
some portion of the text that may not carry over well otherwise. For
example, explaining some idiom or cultural reference that is common in
the source language but may be unknown in the target locale. Sometimes
this problem can be avoided by translating in a different way, but on
occasion this will throw away the flavour of the document at the same
time.

Some Russian texts are full of really nice sayings and proverbs. These
do not always make sense in English, but translating them away would
ruin the text in some sense, so a footnote explaining the saying and its
origin is a good substitute. Now, this is not going to be a really
common problem in technical documentation, but it is not inconceivable.

Another case where extra elements may be required is in a list of steps
to follow. In English, there may be less steps than in, say, Chinese,
when some explanation of setting up an alternate input method may also
be required. So the translator will need to add extra <listitem> tags
(and since listitems contain paragraphs, they end up being one of more
separate translatable items in my scheme).

Allowing these necessary additions while prohibiting other more
arbitrary changes seems hard, so if you really see this as an obstacle,
maybe you can set out how you see it working


Of course, you are nothing but right!
This, of course, (and that's the reason I really don't like this idea) brings all the problems. If translation is done in PO-style, you have to create new entries. Entries for which there is no original. How do you want to handle this? Make up unique dummy-strings to be used as msgid's? These not only have to be unique to the file, but unique to the database, since else you cannot really use a po-database, else you'll get all the wrong placements. Alternatively, use the translation not only as msgstr, but also as msgid? What do you do when you have an update to the original and try a new merge? Place a unique character sequence in front of the translation to be able to determine the things you have to leave out? Like the opposite of not marking it for translation? Well, it's doable but not very nice, as far as I can see. Next, you will have to extend your tool (we're using kbabel), since adding new entries is generally not supported. That requires more engineering effort on a third-party tool.

IMO, if a document is anticipated to undergo quite a few such changes, then I don't think PO-files are a good option. I'd rather convert the whole thing into something that gives me something a little more WYSIWYG. PO-files I believe is a good option for translations, where the structure of the document remains the same within the translation process.

But maybe that's just me? ;-)

I don't really care who does it -- I just want to implement a way it can
be done. This particular issue (translating only portions of otherwise
untranslatable blocks) is one I really do not have a any idea about how
to resolve. Right now, there is a basket labelled "too hard" and it
contains this item alone. For everything else, I have at least a first
approximation of a solution.

Whenever you add a paragraph as msgid, check for certain tags within the paragraph. If these certain tags are within the paragraph, look up your PO-database for each and everyone of them. Replace the text in between these tags and place the result (english text, translated tags) as msgstr and mark it fuzzy. That's what I'd do (given you don't find a larger portion of the text already in the database).

While I realise (or at least hope) you are partially tongue-in-cheek
here, let me be explicit for everybody else: I want to extract the
translatable portions from the original, unadulterated markup. Unlike
marking code for internationalisation purposes, I do not want to require
special markup in the sources. So all of the work about what tags to
include in the strings to be translated and where to break the strings
and what to translate is to be done by this mythical tool we are
designing..


Completely agree. We cannot expect anyone to place any special tags in there, neither can we expect them to have no errors in tags, or to have placed all tags correctly (and really all of them).

[By the way, are you on gnome-i18n-list or do you want to be CC'd on any
future significant developments here? I will at least pull together
another design spec based on the feedback so far and knock up some code
to show none of this is impossible this weekend.]


If you mean gnome-i18n gnome org, then yes. I'm on the list.

And I'm waiting for the code!! :-)

Cheers,
Bernd

--
Dr. Bernd R. Groh <bgroh redhat com>
Red Hat Asia-Pacific
Disclaimer: http://apac.redhat.com/disclaimer

"Everything we know is an illusion,
nothing we know is real,
nothing real we can know,
illusion is what we call reality."





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]