Re: About translating documents (.xml/.sgml) in GNOME



Cool! More feedback. I seem to have created a monster here, though ...
people with experience in this field appear to be raising valid issues.

I really appreciate all the comments, though and have tried to clear up
any places where there may be confusion.

On Fri, Jan 31, 2003 at 10:46:26AM +0100, Bernd Groh wrote:
> >A documentation translator -- design document
> >==============================================
[...]
> >DESIGN IDEAS:
> >--------------
> >The are two halves to this program. The first part is extracting all of the
> >translatable strings into the po files, ready for translation. The second 
> >part
> >is creating translated documents from the po files at build time.
> >
> >(1) Extracting the strings
> >
> >	When run, the program is given a list of tags which are
> >	considered "block elements" (by analogy with the concept in
> >	HTML). These tags are the ones which do not have significant
> >	bearing on the ability of their contents to be translated. So
> >	they are dropped in the conversion to po format. By way of
> >	example, in HTML we would consider the following tags to be
> >	amongst those which are block elements: p, h1, h2, br, hr,
> >	table, and so on.
> >
> 
> Hmmm, that's a start.
> 
> >	Determining the list of block elements is a separate task (and
> >	customisable at run time). However, for convenience, the program will
> >	already understand the block elements from HTML and DocBook 4.1, so
> >	invoking with the --html or --docbook options will suffice in those
> >	cases.
> >
> 
> It's not just a separate task, it's IMO THE task. Everything else is 
> simply a programming effort. And I don't think it's as simple as 
> compiling a list. You have to consider quite a number of things, 
> starting with nested tags, and nested tags within such nested tags, etc. 
> Then there are missing tags, errors in tags, etc. And do not forget 
> starting and ending comments, and especially ...<![%T1 [...]]><![%T2 
> [...]]><![%T3 [...]]>..., where ... here can be any valid SGML-DocBook, 
> to take that as an example. If somebody is able to correctly parse 
> SGML-DocBook and gets all the tokens, then we are basically finished. 
> Well, at least we are almost finished. :-)

OK, here we (you and I) have a small deviation in design goals. My
intention is _not_ to work with fully general SGML syntax. The main
focus, initially, is to handle DocBook-XML and HTML, since they are the
two formats that are needed for GNOME translations. You get all XML
documents for free at that point since it makes no sense to just write
DocBook-specific code when I have a full XML parser at my disposal.

The HTML handling requires some special cases, due to optional closing
tags, etc, but I am going to do exactly that -- special case them.
Handling fully general SGML documents is maybe a goal for the future,
but it's not something I am going to touch initially because it adds
soooo much to the original problem. Right now, I am designing to solve a
particular problem. However, I am not going to write code that makes it
impossible to extend things in the future (to handle all SGML, etc).

> >	The program can also be passed a list of "do not translate" elements.
> >	These would include tags like programlisting in DocBook, where it is
> >	almost always going to be correct to leave the source text
> >	untranslated.
> >
> 
> I think the --docbook option should know that by default. :-)

It does in the DocBook case. For other XML documents you will need to
specify them explicitly. I was using "programlisting" as an example of
how such a tag might exist. This is just a case where I can implement
all XML documents "for free", so adding support for them in the runtime
configuration is necessary and not too hard.

I think we are agreement here that the --docbook option should do
everything expected on a normal DocBook document.

> >	Each message string in the po file will then be as large a chunk
> >	as is possible to extract from the document without including
> >	block elements or "do not translate" elements. Any additional
> >	markup between such elements will be included in the message
> >	string, since they provide information for the translators.
> 
> I believe <emphasis>, etc. tags should be left, so should <command>, 
> etc. tags. <guilabel>, etc. tags, should IMO be translated 
> automatically, simply being looked up in the po-database (if there, that 
> is).

This is sort of the point that Sander Vesik raised as well, I think
(w.r.t things like <guilabel>). Tags like <emphasis> are inline anyway,
so they stay. They are important to the translation context. Sander's
earlier mail and yours has highlighted the necessity of creating a
category of tags that are always going to have their contents translated
the same way.

> >	[NOTE: Typically, a chunk for translation will be a paragraph. This
> >	seems like a sensible division, since it may lead to a better
> >	translation to reorganise the sentence structure, but keeping the
> >	paragraph structure the same should not be too much of a burden, from
> >	my limited experience of other languages.]
> >
> 
> Paragraph seems ok. Just get rid of any eventual <footnotes>, etc. I'd 
> change them into [1..n], since you have to keep the placement and then 
> have the footnote separate in the next entry. Of course there are other 
> parts to be considered, such as <entry>, etc. which should be separate 
> entries as well. Same goes for index terms and titles. If you keep the 
> order, then you should have enough context. Another easy solution would 
> be to use the closest index-term and a fixed url to provide (in the 
> comments to the entry) a link to the english HTML page. This should 
> provide enough context. :-)

All good points. I had not thought of this. Will add them to the design.
Again, it comes down to a matter of representing that there is a
footnote in the plain text and if the translation reorders things, the
footnote insertion point needs to be moved as well. That is not hard to
do, though.

> >	In a normal program internationalisation effort, all strings
> >	from all files are put into a single po file for each language.
> >	However, when translating documentation, this approach does not
> >	seem efficient.  Firstly, it is not unreasonable to expect that
> >	only a fraction of the documentation in any package will be
> >	initially translated. Secondly, the po files will be much larger
> >	than for all but the largest programs, since user and developer
> >	documentation is often quite lengthy.  Typical use of this
> >	program will therefore place the po files for each document in
> >	their own directory (probably under the document's source
> >	directory, or its immediate parent).
> 
> I'd say one po-file per sgml-file. In this way you can keep the 
> file-structure as well, which also makes it easier to put the 
> translations back into sgml.

Agreed. That was my intention.

> >	[NOTE: It has also been floated that, since the source is
> >	usually under a C/ directory, alternative translations can go in
> >	directories labelled by their locale name -- so es/, no/, and so
> >	forth. The files in these directory would still be .po format
> >	files so that translators can use their current familiar
> >	techniques.]
> >
> >	In order to allow the documentation writer to provide
> >	translation hints and context descriptions to the translator,
> >	any comment block immediately prior to a message string is
> >	included in the po file as a comment (just as gettext does). So,
> >	for example, 
> >
> >		<!-- Insert an appropriate city for your locale -->
> >		<para>To predict the weather in Sydney, you would... </para>
> >
> >	will appear in the po file as
> >
> >		#. Insert an appropriate city for your locale
> >		#: weather-applet.xml:45,6
> >		msgid "To predict the weather in Sydney, you would... "
> >		msgstr ""
> 
> Nice idea! :-) If you get the writers to comment properly that is! ;-)

I expect that, in practice, it will work the same as with comments in
source code attached to comments: the comment is added after a few
translators have sent email wondering what the string is meant to be
saying. I will add the functionality so it can be used if necessary.

> >	As with gettext and intltool, this program will have an update
> >	mode for updating existing po files with new and changed strings
> >	and a report mode for giving statistics about the current
> >	translation status for each locale. Ideally, this will be run
> >	from the top of the source tree and will update all relevant .po
> >	files, without creating any new ones (new ones are created
> >	explicitly).
> 
> Where for each updated file, we should first create the entire po-file 
> again and then do the comparisons on the po-s, rather than trying to 
> figure what's been changed in the sgml-file. Which then again shows how 
> important it is to get that one right and comprehensive.

Once we are splitting things up along the lines of what I earlier called
"block elements" (essentially one msgstr = one paragraph), the updates
are pretty easy, except in one case. When a paragraph is split in two or
two paragraphs are merged, things will probably go a bit pear shaped.
But I think I can do _something_ sensible here to at indicate that this
might have been what possibly happened.

Your feeling is probably right, though -- merging at the po-file level
is probably the easiest way.

> I'd even use the new po-file, and merge from the previous one what I 
> could, rather than doing it the other way around, in this way, you'll 
> always be able to keep the structural information of the newest sgml, 
> which you'll need to convert the po back eventually.

That is more or less the way gettext does it as well (under the covers).
It makes sense to me.

> >(2) Building the translated documents
> >
> >[To be done. Fiddly, but not rocket science.]
> >
> 
> What about leaving the original document INTACT? But only replacing 
> every string you steal with a unique identifier, e.g.
> 
> [...]
> <title>#1</title>
> 
> <para>#2</para>
> <para>#3</para>
> [...]
> 
> This then becomes a "style-file" for every language. If you reverse with 
> the option --lang=pt, you get the portugese sgml, if you do a --lang=es 
> the spanish one. And whatever the original language was (let's assume 
> en_US), --lang=en_US, then exactly gives you back the original.

My gut reaction is "no", but I can see what you are talking about. I
want to keep the source document _completely_ unchanged, including the
original language it was written in. This enables us to take functional
web pages, third-party documents and so on and translate them
"transparently". However, in reality, this is exactly how I am
intending to create the translated versions -- although the version you
describe above will probably only exist as a virtual document during the
document creation phase for each locale.

> Here we would have one less thing to worry about. The only thing we'd 
> need is <haha>simply</haha> a properly working and comprehensive parser 
> (that, in addition, can pick up errors in tags).
> 
> As said, I really think that this is the ONLY real issue. :-)

> >ISSUES TO BE RESOLVED
> >----------------------
> >
> >[snip]
> >
> >	- Will there be cases where a translator needs to include arbitrary
> >	  amounts of extra text to make the translation appropriate? This is
> >	  probably fairly hard to do in full generality, so if it's not
> >	  obviously required, it will be omitted.
> >
> Why is adding extra text a problem?

First point: all of this section is me "thinking out loud", so there is
some repeating of my ideas. I must admit that I posted this spec fairly
quickly after coming into work and seeing that the topic had arisen on
the list. What you have been reading is a document I originally wrote
just for myself to order my thoughts.

Second point: the case where extra text will be difficult is when the
extra text includes extra block-level tags (which you discuss below).
In the other cases I can think of, it will not be a problem, though.

> >	- Can the translator include extra block elements in their
> >	translations of strings? Yes, but the merge will simply insert
> >	them -- no sanity checking will be done.
> >
> 
> What is meant with block elements? Tags? Like <para> tags?

When I wrote "block elements", I meant portions of text that were
surrounded by tags like <para>, yes.

> If yes, then I'm utterly opposed to this idea and it should IMO not be
> permitted!  Why? Because I believe that created documents, independent
> of the languages used, should result in the same structure. Well,
> since different documents have different formats and I am
> predominantly (to be exact exclusively) thinking of manuals written in
> SGML DocBook we might be able to have an option to either enable, or,
> what I prefer, disable this.

Here I disagree with you, because I think you have missed an important
corner case. In many texts I have read that are translated from other
languages, the translator has often added useful footnotes to explain
some portion of the text that may not carry over well otherwise. For
example, explaining some idiom or cultural reference that is common in
the source language but may be unknown in the target locale. Sometimes
this problem can be avoided by translating in a different way, but on
occasion this will throw away the flavour of the document at the same
time.

Some Russian texts are full of really nice sayings and proverbs. These
do not always make sense in English, but translating them away would
ruin the text in some sense, so a footnote explaining the saying and its
origin is a good substitute. Now, this is not going to be a really
common problem in technical documentation, but it is not inconceivable.

Another case where extra elements may be required is in a list of steps
to follow. In English, there may be less steps than in, say, Chinese,
when some explanation of setting up an alternate input method may also
be required. So the translator will need to add extra <listitem> tags
(and since listitems contain paragraphs, they end up being one of more
separate translatable items in my scheme).

Allowing these necessary additions while prohibiting other more
arbitrary changes seems hard, so if you really see this as an obstacle,
maybe you can set out how you see it working

> >	- Some "non-translatable" blocks will, unfortunately, contain
> >	  translatable portions. For example, a program listing in Python is
> >	  non-translatable. However, some of the strings it prints out _are_
> >	  translatable. I have no idea how to handle this case yet, except by
> >	  having the whole example substituted by a translation. Maybe such
> >	  blocks can be preceeded by a comment saying that they contain
> >	  translatable strings?!
> 
> Who marks what is translatable and what not? Not the writers, or?

I don't really care who does it -- I just want to implement a way it can
be done. This particular issue (translating only portions of otherwise
untranslatable blocks) is one I really do not have a any idea about how
to resolve. Right now, there is a basket labelled "too hard" and it
contains this item alone. For everything else, I have at least a first
approximation of a solution.

> At least I don't think you can expect it of them. If you could, we
> could simply introduce a new tag, such as <trans></trans> and simply
> collect everything within these tags. But now that would be too easy!
> ;-)

While I realise (or at least hope) you are partially tongue-in-cheek
here, let me be explicit for everybody else: I want to extract the
translatable portions from the original, unadulterated markup. Unlike
marking code for internationalisation purposes, I do not want to require
special markup in the sources. So all of the work about what tags to
include in the strings to be translated and where to break the strings
and what to translate is to be done by this mythical tool we are
designing..

[By the way, are you on gnome-i18n-list or do you want to be CC'd on any
future significant developments here? I will at least pull together
another design spec based on the feedback so far and knock up some code
to show none of this is impossible this weekend.]

Cheers,
Malcolm

-- 
On the other hand, you have different fingers.

PGP signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]