Re: About translating documents (.xml/.sgml) in GNOME



Malcolm,

>>I am interested in the translation of .xml/.sgml files in GNOME 2.2
>>and I would like to ask if:
>>	1. ...the following procedure using sgml2po is ok.
>>    
>>
>
>Some discussion on this below (but I have no clear opinion on the
>viability, since I am not a translator).
>

Thanks for bringing this up again!

>------------------------------------------------------------------------
>
>A documentation translator -- design document
>==============================================
>($Date$)
>
>GOAL:
>------
>Much GNOME documentation exists in the form of DocBook-SGML, DocBook-XML
>and (X)HTML documents. Translators are most comfortable working with GNU
>gettext-style po files. The aim of this program is to provide an efficient
>means of converting documentation source into po files and then merging the
>resulting translations back into a document for distribution.
>

That would be a good thing, given you can do it right.

>CONSIDERATIONS:
>----------------
>[snip]
>

Nothing to add yet.

>DESIGN IDEAS:
>--------------
>The are two halves to this program. The first part is extracting all of the
>translatable strings into the po files, ready for translation. The second part
>is creating translated documents from the po files at build time.
>
>(1) Extracting the strings
>
>	When run, the program is given a list of tags which are considered
>	"block elements" (by analogy with the concept in HTML). These tags are
>	the ones which do not have significant bearing on the ability of their
>	contents to be translated. So they are dropped in the conversion to po
>	format. By way of example, in HTML we would consider the following tags
>	to be amongst those which are block elements: p, h1, h2, br, hr, table,
>	and so on.
>

Hmmm, that's a start.

>	Determining the list of block elements is a separate task (and
>	customisable at run time). However, for convenience, the program will
>	already understand the block elements from HTML and DocBook 4.1, so
>	invoking with the --html or --docbook options will suffice in those
>	cases.
>

It's not just a separate task, it's IMO THE task. Everything else is 
simply a programming effort. And I don't think it's as simple as 
compiling a list. You have to consider quite a number of things, 
starting with nested tags, and nested tags within such nested tags, etc. 
Then there are missing tags, errors in tags, etc. And do not forget 
starting and ending comments, and especially ...<![%T1 [...]]><![%T2 
[...]]><![%T3 [...]]>..., where ... here can be any valid SGML-DocBook, 
to take that as an example. If somebody is able to correctly parse 
SGML-DocBook and gets all the tokens, then we are basically finished. 
Well, at least we are almost finished. :-)

>	The program can also be passed a list of "do not translate" elements.
>	These would include tags like programlisting in DocBook, where it is
>	almost always going to be correct to leave the source text
>	untranslated.
>

I think the --docbook option should know that by default. :-)

>	Each message string in the po file will then be as large a chunk as is
>	possible to extract from the document without including block elements
>	or "do not translate" elements. Any additional markup between such
>	elements will be included in the message string, since they provide
>	information for the translators.
>

I believe <emphasis>, etc. tags should be left, so should <command>, 
etc. tags. <guilabel>, etc. tags, should IMO be translated 
automatically, simply being looked up in the po-database (if there, that 
is).

>	[NOTE: Typically, a chunk for translation will be a paragraph. This
>	seems like a sensible division, since it may lead to a better
>	translation to reorganise the sentence structure, but keeping the
>	paragraph structure the same should not be too much of a burden, from
>	my limited experience of other languages.]
>

Paragraph seems ok. Just get rid of any eventual <footnotes>, etc. I'd 
change them into [1..n], since you have to keep the placement and then 
have the footnote separate in the next entry. Of course there are other 
parts to be considered, such as <entry>, etc. which should be separate 
entries as well. Same goes for index terms and titles. If you keep the 
order, then you should have enough context. Another easy solution would 
be to use the closest index-term and a fixed url to provide (in the 
comments to the entry) a link to the english HTML page. This should 
provide enough context. :-)

>	In a normal program internationalisation effort, all strings from all
>	files are put into a single po file for each language. However, when
>	translating documentation, this approach does not seem efficient.
>	Firstly, it is not unreasonable to expect that only a fraction of the
>	documentation in any package will be initially translated. Secondly,
>	the po files will be much larger than for all but the largest programs,
>	since user and developer documentation is often quite lengthy. Typical
>	use of this program will therefore place the po files for each document
>	in their own directory (probably under the document's source directory,
>	or its immediate parent).
>

I'd say one po-file per sgml-file. In this way you can keep the 
file-structure as well, which also makes it easier to put the 
translations back into sgml.

>	[NOTE: It has also been floated that, since the source is usually
>	under a C/ directory, alternative translations can go in directories
>	labelled by their locale name -- so es/, no/, and so forth. The files
>	in these directory would still be .po format files so that translators
>	can use their current familiar techniques.]
>
>	In order to allow the documentation writer to provide translation hints
>	and context descriptions to the translator, any comment block
>	immediately prior to a message string is included in the po file as a
>	comment (just as gettext does). So, for example, 
>
>		<!-- Insert an appropriate city for your locale -->
>		<para>To predict the weather in Sydney, you would... </para>
>
>	will appear in the po file as
>
>		#. Insert an appropriate city for your locale
>		#: weather-applet.xml:45,6
>		msgid "To predict the weather in Sydney, you would... "
>		msgstr ""
>

Nice idea! :-) If you get the writers to comment properly that is! ;-)

>	As with gettext and intltool, this program will have an update mode for
>	updating existing po files with new and changed strings and a report
>	mode for giving statistics about the current translation status for
>	each locale. Ideally, this will be run from the top of the source tree
>	and will update all relevant .po files, without creating any new ones
>	(new ones are created explicitly).
>

Where for each updated file, we should first create the entire po-file 
again and then do the comparisons on the po-s, rather than trying to 
figure what's been changed in the sgml-file. Which then again shows how 
important it is to get that one right and comprehensive.
I'd even use the new po-file, and merge from the previous one what I 
could, rather than doing it the other way around, in this way, you'll 
always be able to keep the structural information of the newest sgml, 
which you'll need to convert the po back eventually.

>(2) Building the translated documents
>
>[To be done. Fiddly, but not rocket science.]
>

What about leaving the original document INTACT? But only replacing 
every string you steal with a unique identifier, e.g.

[...]
<title>#1</title>

<para>#2</para>
<para>#3</para>
[...]

This then becomes a "style-file" for every language. If you reverse with 
the option --lang=pt, you get the portugese sgml, if you do a --lang=es 
the spanish one. And whatever the original language was (let's assume 
en_US), --lang=en_US, then exactly gives you back the original.

Here we would have one less thing to worry about. The only thing we'd 
need is <haha>simply</haha> a properly working and comprehensive parser 
(that, in addition, can pick up errors in tags).

As said, I really think that this is the ONLY real issue. :-)

>ISSUES TO BE RESOLVED
>----------------------
>
>[snip]
>
>	- Will there be cases where a translator needs to include arbitrary
>	  amounts of extra text to make the translation appropriate? This is
>	  probably fairly hard to do in full generality, so if it's not
>	  obviously required, it will be omitted.
>

Why is adding extra text a problem?

>	- Can the translator include extra block elements in their translations
>	  of strings? Yes, but the merge will simply insert them -- no sanity
>	  checking will be done.
>

What is meant with block elements? Tags? Like <para> tags? If yes, then 
I'm utterly opposed to this idea and it should IMO not be permitted! 
Why? Because I believe that created documents, independent of the 
languages used, should result in the same structure. Well, since 
different documents have different formats and I am predominantly (to be 
exact exclusively) thinking of manuals written in SGML DocBook we might 
be able to have an option to either enable, or, what I prefer, disable this.

>	- Some "non-translatable" blocks will, unfortunately, contain
>	  translatable portions. For example, a program listing in Python is
>	  non-translatable. However, some of the strings it prints out _are_
>	  translatable. I have no idea how to handle this case yet, except by
>	  having the whole example substituted by a translation. Maybe such
>	  blocks can be preceeded by a comment saying that they contain
>	  translatable strings?!
>  
>

Who marks what is translatable and what not? Not the writers, or? At 
least I don't think you can expect it of them. If you could, we could 
simply introduce a new tag, such as <trans></trans> and simply collect 
everything within these tags. But now that would be too easy! ;-)

Cheers,
Bernd

-- 
Dr. Bernd R. Groh <bgroh@redhat.com>
Red Hat Asia-Pacific
Disclaimer: http://apac.redhat.com/disclaimer

"Everything we know is an illusion,
 nothing we know is real,
 nothing real we can know,
 illusion is what we call reality."





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]