Re: About translating documents (.xml/.sgml) in GNOME



Malcolm,

I am interested in the translation of .xml/.sgml files in GNOME 2.2
and I would like to ask if:
	1. ...the following procedure using sgml2po is ok.

Some discussion on this below (but I have no clear opinion on the
viability, since I am not a translator).


Thanks for bringing this up again!

------------------------------------------------------------------------

A documentation translator -- design document
==============================================
($Date$)

GOAL:
------
Much GNOME documentation exists in the form of DocBook-SGML, DocBook-XML
and (X)HTML documents. Translators are most comfortable working with GNU
gettext-style po files. The aim of this program is to provide an efficient
means of converting documentation source into po files and then merging the
resulting translations back into a document for distribution.


That would be a good thing, given you can do it right.

CONSIDERATIONS:
----------------
[snip]


Nothing to add yet.

DESIGN IDEAS:
--------------
The are two halves to this program. The first part is extracting all of the
translatable strings into the po files, ready for translation. The second part
is creating translated documents from the po files at build time.

(1) Extracting the strings

	When run, the program is given a list of tags which are considered
	"block elements" (by analogy with the concept in HTML). These tags are
	the ones which do not have significant bearing on the ability of their
	contents to be translated. So they are dropped in the conversion to po
	format. By way of example, in HTML we would consider the following tags
	to be amongst those which are block elements: p, h1, h2, br, hr, table,
	and so on.


Hmmm, that's a start.

	Determining the list of block elements is a separate task (and
	customisable at run time). However, for convenience, the program will
	already understand the block elements from HTML and DocBook 4.1, so
	invoking with the --html or --docbook options will suffice in those
	cases.


It's not just a separate task, it's IMO THE task. Everything else is simply a programming effort. And I don't think it's as simple as compiling a list. You have to consider quite a number of things, starting with nested tags, and nested tags within such nested tags, etc. Then there are missing tags, errors in tags, etc. And do not forget starting and ending comments, and especially ...<![%T1 [...]]><![%T2 [...]]><![%T3 [...]]>..., where ... here can be any valid SGML-DocBook, to take that as an example. If somebody is able to correctly parse SGML-DocBook and gets all the tokens, then we are basically finished. Well, at least we are almost finished. :-)

	The program can also be passed a list of "do not translate" elements.
	These would include tags like programlisting in DocBook, where it is
	almost always going to be correct to leave the source text
	untranslated.


I think the --docbook option should know that by default. :-)

	Each message string in the po file will then be as large a chunk as is
	possible to extract from the document without including block elements
	or "do not translate" elements. Any additional markup between such
	elements will be included in the message string, since they provide
	information for the translators.


I believe <emphasis>, etc. tags should be left, so should <command>, etc. tags. <guilabel>, etc. tags, should IMO be translated automatically, simply being looked up in the po-database (if there, that is).

	[NOTE: Typically, a chunk for translation will be a paragraph. This
	seems like a sensible division, since it may lead to a better
	translation to reorganise the sentence structure, but keeping the
	paragraph structure the same should not be too much of a burden, from
	my limited experience of other languages.]


Paragraph seems ok. Just get rid of any eventual <footnotes>, etc. I'd change them into [1..n], since you have to keep the placement and then have the footnote separate in the next entry. Of course there are other parts to be considered, such as <entry>, etc. which should be separate entries as well. Same goes for index terms and titles. If you keep the order, then you should have enough context. Another easy solution would be to use the closest index-term and a fixed url to provide (in the comments to the entry) a link to the english HTML page. This should provide enough context. :-)

	In a normal program internationalisation effort, all strings from all
	files are put into a single po file for each language. However, when
	translating documentation, this approach does not seem efficient.
	Firstly, it is not unreasonable to expect that only a fraction of the
	documentation in any package will be initially translated. Secondly,
	the po files will be much larger than for all but the largest programs,
	since user and developer documentation is often quite lengthy. Typical
	use of this program will therefore place the po files for each document
	in their own directory (probably under the document's source directory,
	or its immediate parent).


I'd say one po-file per sgml-file. In this way you can keep the file-structure as well, which also makes it easier to put the translations back into sgml.

	[NOTE: It has also been floated that, since the source is usually
	under a C/ directory, alternative translations can go in directories
	labelled by their locale name -- so es/, no/, and so forth. The files
	in these directory would still be .po format files so that translators
	can use their current familiar techniques.]

	In order to allow the documentation writer to provide translation hints
	and context descriptions to the translator, any comment block
	immediately prior to a message string is included in the po file as a
comment (just as gettext does). So, for example,
		<!-- Insert an appropriate city for your locale -->
		<para>To predict the weather in Sydney, you would... </para>

	will appear in the po file as

		#. Insert an appropriate city for your locale
		#: weather-applet.xml:45,6
		msgid "To predict the weather in Sydney, you would... "
		msgstr ""


Nice idea! :-) If you get the writers to comment properly that is! ;-)

	As with gettext and intltool, this program will have an update mode for
	updating existing po files with new and changed strings and a report
	mode for giving statistics about the current translation status for
	each locale. Ideally, this will be run from the top of the source tree
	and will update all relevant .po files, without creating any new ones
	(new ones are created explicitly).


Where for each updated file, we should first create the entire po-file again and then do the comparisons on the po-s, rather than trying to figure what's been changed in the sgml-file. Which then again shows how important it is to get that one right and comprehensive. I'd even use the new po-file, and merge from the previous one what I could, rather than doing it the other way around, in this way, you'll always be able to keep the structural information of the newest sgml, which you'll need to convert the po back eventually.

(2) Building the translated documents

[To be done. Fiddly, but not rocket science.]


What about leaving the original document INTACT? But only replacing every string you steal with a unique identifier, e.g.

[...]
<title>#1</title>

<para>#2</para>
<para>#3</para>
[...]

This then becomes a "style-file" for every language. If you reverse with the option --lang=pt, you get the portugese sgml, if you do a --lang=es the spanish one. And whatever the original language was (let's assume en_US), --lang=en_US, then exactly gives you back the original.

Here we would have one less thing to worry about. The only thing we'd need is <haha>simply</haha> a properly working and comprehensive parser (that, in addition, can pick up errors in tags).

As said, I really think that this is the ONLY real issue. :-)

ISSUES TO BE RESOLVED
----------------------

[snip]

	- Will there be cases where a translator needs to include arbitrary
	  amounts of extra text to make the translation appropriate? This is
	  probably fairly hard to do in full generality, so if it's not
	  obviously required, it will be omitted.


Why is adding extra text a problem?

	- Can the translator include extra block elements in their translations
	  of strings? Yes, but the merge will simply insert them -- no sanity
	  checking will be done.


What is meant with block elements? Tags? Like <para> tags? If yes, then I'm utterly opposed to this idea and it should IMO not be permitted! Why? Because I believe that created documents, independent of the languages used, should result in the same structure. Well, since different documents have different formats and I am predominantly (to be exact exclusively) thinking of manuals written in SGML DocBook we might be able to have an option to either enable, or, what I prefer, disable this.

	- Some "non-translatable" blocks will, unfortunately, contain
	  translatable portions. For example, a program listing in Python is
	  non-translatable. However, some of the strings it prints out _are_
	  translatable. I have no idea how to handle this case yet, except by
	  having the whole example substituted by a translation. Maybe such
	  blocks can be preceeded by a comment saying that they contain
	  translatable strings?!

Who marks what is translatable and what not? Not the writers, or? At least I don't think you can expect it of them. If you could, we could simply introduce a new tag, such as <trans></trans> and simply collect everything within these tags. But now that would be too easy! ;-)

Cheers,
Bernd

--
Dr. Bernd R. Groh <bgroh redhat com>
Red Hat Asia-Pacific
Disclaimer: http://apac.redhat.com/disclaimer

"Everything we know is an illusion,
nothing we know is real,
nothing real we can know,
illusion is what we call reality."





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]