Re: About translating documents (.xml/.sgml) in GNOME



[Including gnome-i18n again, since I would like comments from
translators. Also including Jeff, since he and I were discussing this
proposal just over a week ago and he hasn't seen the written spec.]

On Wed, Jan 29, 2003 at 07:54:27PM +0000, Simos Xenitellis wrote:
> I am interested in the translation of .xml/.sgml files in GNOME 2.2
> and I would like to ask if:
> 	1. ...the following procedure using sgml2po is ok.

Some discussion on this below (but I have no clear opinion on the
viability, since I am not a translator).

> 	2. ...is there a list of docs to translate first?

Short answer: no. However, for maximum usefulness, my guess would be to
start with the user documentation for popular applications (where you
get to choose what is "popular").

This might be a good opportunity to float another approach to this
problem. I looked at DV and Jonathans' work last year and I didn't think
it entirely met the requirements we might have. So I sat down and wrote
out a specification of what might be in such a tool (a documentation
version of intltool, essentially).

I have attached a fairly different approach to the problem from Simos'
which I would appreciate some feedback on. My main problem is that I am
not a translator, so while I have tried to incorporate what I see as the
main issues, there may be things I don't realise just because I don't
speak Norwegian or something like that.

As an aside, one of the motivations of this approach was to also make it
easy to translate websites without requiring special markup in the HTML.

I have code for some of the following (mostly extracting the strings for
translation and constructing something that is pretty close to a valid
.po file). If people cannot shoot too many holes in my method I will
continue down this path, although that does not help Simos with his
immediate problem of translating _now_.

Probably I should point out that the main problem I see in Simos'
approach is that you will get a lost of unnecessary stuff in the .po
file, which looks like it will interfere with smooth translations. Also,
I could not see how some of the "issues to be resolved" were addresses
by that code (to be fair, the authors mentioned at the time that it was
a prototype of an idea).

Malcolm

-- 
He who laughs last thinks slowest.
A documentation translator -- design document
==============================================
($Date$)

GOAL:
------
Much GNOME documentation exists in the form of DocBook-SGML, DocBook-XML
and (X)HTML documents. Translators are most comfortable working with GNU
gettext-style po files. The aim of this program is to provide an efficient
means of converting documentation source into po files and then merging the
resulting translations back into a document for distribution.

CONSIDERATIONS:
----------------
	- The source documentation should not require special markup. It should
	  be possible to feed in an arbitrary HTML or DocBook-XML page and get
	  out a file of strings to be translated.

	- The format used for translation should be in an identical format to
	  the po files generated by tools such as gettext and intltool.

	- Efforts should be made to remove as much markup as possible from the
	  message strings in the po files to ease the burden on translators.
	  However, care also needs to be taken not to remove too much markup
	  and to keep contextually related chunks together.

	- There should be a way for document authors to pass comments into the
	  po files, similarly to the way gettext extracts comments preceding
	  translatable strings currently.

	- The package build system should, ideally, automatically create
	  the documents for each language from the various po files. It should
	  also be possible to only build the docs for one language to decrease
	  the package size (e.g. make DOC_LANG=en docs).

DESIGN IDEAS:
--------------
The are two halves to this program. The first part is extracting all of the
translatable strings into the po files, ready for translation. The second part
is creating translated documents from the po files at build time.

(1) Extracting the strings

	When run, the program is given a list of tags which are considered
	"block elements" (by analogy with the concept in HTML). These tags are
	the ones which do not have significant bearing on the ability of their
	contents to be translated. So they are dropped in the conversion to po
	format. By way of example, in HTML we would consider the following tags
	to be amongst those which are block elements: p, h1, h2, br, hr, table,
	and so on.

	Determining the list of block elements is a separate task (and
	customisable at run time). However, for convenience, the program will
	already understand the block elements from HTML and DocBook 4.1, so
	invoking with the --html or --docbook options will suffice in those
	cases.

	The program can also be passed a list of "do not translate" elements.
	These would include tags like programlisting in DocBook, where it is
	almost always going to be correct to leave the source text
	untranslated.

	Each message string in the po file will then be as large a chunk as is
	possible to extract from the document without including block elements
	or "do not translate" elements. Any additional markup between such
	elements will be included in the message string, since they provide
	information for the translators.

	[NOTE: Typically, a chunk for translation will be a paragraph. This
	seems like a sensible division, since it may lead to a better
	translation to reorganise the sentence structure, but keeping the
	paragraph structure the same should not be too much of a burden, from
	my limited experience of other languages.]

	In a normal program internationalisation effort, all strings from all
	files are put into a single po file for each language. However, when
	translating documentation, this approach does not seem efficient.
	Firstly, it is not unreasonable to expect that only a fraction of the
	documentation in any package will be initially translated. Secondly,
	the po files will be much larger than for all but the largest programs,
	since user and developer documentation is often quite lengthy. Typical
	use of this program will therefore place the po files for each document
	in their own directory (probably under the document's source directory,
	or its immediate parent).
	
	[NOTE: It has also been floated that, since the source is usually
	under a C/ directory, alternative translations can go in directories
	labelled by their locale name -- so es/, no/, and so forth. The files
	in these directory would still be .po format files so that translators
	can use their current familiar techniques.]

	In order to allow the documentation writer to provide translation hints
	and context descriptions to the translator, any comment block
	immediately prior to a message string is included in the po file as a
	comment (just as gettext does). So, for example, 

		<!-- Insert an appropriate city for your locale -->
		<para>To predict the weather in Sydney, you would... </para>

	will appear in the po file as

		#. Insert an appropriate city for your locale
		#: weather-applet.xml:45,6
		msgid "To predict the weather in Sydney, you would... "
		msgstr ""

	As with gettext and intltool, this program will have an update mode for
	updating existing po files with new and changed strings and a report
	mode for giving statistics about the current translation status for
	each locale. Ideally, this will be run from the top of the source tree
	and will update all relevant .po files, without creating any new ones
	(new ones are created explicitly).

(2) Building the translated documents

[To be done. Fiddly, but not rocket science.]


ISSUES TO BE RESOLVED
----------------------

	- Some elements contain attribute values that must be localised. For
	  example, images often need localisation and their location is stored
	  in the src attribute of the img element (in HTML). Maybe the program
	  should accept a list of elements + attributes from which to extract
	  information like this. [Aside: for image, sometimes it is sufficient
	  to identify the images requiring localisation and replace the images
	  themselves at build time. But sometimes, the link is external (e.g. a
	  local mirror, or an image of an appropriate capital city), so the
	  attribute value itself needs to be changed.]

	- Complete localisation may involve adding attributes to certain
	  elements. For example, the list-style attribute of the ul and ol tags
	  in HTML should be set if appropriate (to provide Chinese numbers in
	  Chinese language locales, for example). Probably the style should be
	  set in the stylesheet (see next point) so that the markup in the
	  source document is locale-independent. Need to investigate what other
	  problems like list numberings could arise in this area.

	- The build process should permit each locale to optionally specify a
	  different stylesheet. How?

	- The build process should permit each locale to specify an encoding
	  for the final document. By default the encoding will be UTF-8, but
	  anything supported by iconv is acceptable. How? [Aside: This
	  requirement is important, since East Asian languages, in particular,
	  benefit from using a more traditional encoding than UTF-8 -- the
	  document can shrink to two thirds the size of the equivalent UTF-8
	  version when using Big-5 for Chinese, for example.]

	- Will there be cases where a translator needs to include arbitrary
	  amounts of extra text to make the translation appropriate? This is
	  probably fairly hard to do in full generality, so if it's not
	  obviously required, it will be omitted.

	- Can translators specify extra (locale-specific) files to be included
	  in their localised version? Not too difficult to do.

	- Can the translator include extra block elements in their translations
	  of strings? Yes, but the merge will simply insert them -- no sanity
	  checking will be done.

	- How should translator credits be inserted into the translated version
	  of the document, without adding an unnecessary empty string in the
	  source version? Some special markup required here (maybe a comment
	  with a specific format).

	- Some "non-translatable" blocks will, unfortunately, contain
	  translatable portions. For example, a program listing in Python is
	  non-translatable. However, some of the strings it prints out _are_
	  translatable. I have no idea how to handle this case yet, except by
	  having the whole example substituted by a translation. Maybe such
	  blocks can be preceeded by a comment saying that they contain
	  translatable strings?!


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]