Re: About translating documents (.xml/.sgml) in GNOME
- From: Bernd Groh <bgroh redhat com>
- To: Malcolm Tredinnick <malcolm commsecure com au>
- Cc: Simos Xenitellis <simos74 gmx net>, GNOME Documentation List <gnome-doc-list gnome org>, gnome-i18n gnome org, jdub perkypants org
- Subject: Re: About translating documents (.xml/.sgml) in GNOME
- Date: Fri, 31 Jan 2003 10:46:26 +0100
Malcolm,
I am interested in the translation of .xml/.sgml files in GNOME 2.2
and I would like to ask if:
1. ...the following procedure using sgml2po is ok.
Some discussion on this below (but I have no clear opinion on the
viability, since I am not a translator).
Thanks for bringing this up again!
------------------------------------------------------------------------
A documentation translator -- design document
==============================================
($Date$)
GOAL:
------
Much GNOME documentation exists in the form of DocBook-SGML, DocBook-XML
and (X)HTML documents. Translators are most comfortable working with GNU
gettext-style po files. The aim of this program is to provide an efficient
means of converting documentation source into po files and then merging the
resulting translations back into a document for distribution.
That would be a good thing, given you can do it right.
CONSIDERATIONS:
----------------
[snip]
Nothing to add yet.
DESIGN IDEAS:
--------------
The are two halves to this program. The first part is extracting all of the
translatable strings into the po files, ready for translation. The second part
is creating translated documents from the po files at build time.
(1) Extracting the strings
When run, the program is given a list of tags which are considered
"block elements" (by analogy with the concept in HTML). These tags are
the ones which do not have significant bearing on the ability of their
contents to be translated. So they are dropped in the conversion to po
format. By way of example, in HTML we would consider the following tags
to be amongst those which are block elements: p, h1, h2, br, hr, table,
and so on.
Hmmm, that's a start.
Determining the list of block elements is a separate task (and
customisable at run time). However, for convenience, the program will
already understand the block elements from HTML and DocBook 4.1, so
invoking with the --html or --docbook options will suffice in those
cases.
It's not just a separate task, it's IMO THE task. Everything else is
simply a programming effort. And I don't think it's as simple as
compiling a list. You have to consider quite a number of things,
starting with nested tags, and nested tags within such nested tags, etc.
Then there are missing tags, errors in tags, etc. And do not forget
starting and ending comments, and especially ...<![%T1 [...]]><![%T2
[...]]><![%T3 [...]]>..., where ... here can be any valid SGML-DocBook,
to take that as an example. If somebody is able to correctly parse
SGML-DocBook and gets all the tokens, then we are basically finished.
Well, at least we are almost finished. :-)
The program can also be passed a list of "do not translate" elements.
These would include tags like programlisting in DocBook, where it is
almost always going to be correct to leave the source text
untranslated.
I think the --docbook option should know that by default. :-)
Each message string in the po file will then be as large a chunk as is
possible to extract from the document without including block elements
or "do not translate" elements. Any additional markup between such
elements will be included in the message string, since they provide
information for the translators.
I believe <emphasis>, etc. tags should be left, so should <command>,
etc. tags. <guilabel>, etc. tags, should IMO be translated
automatically, simply being looked up in the po-database (if there, that
is).
[NOTE: Typically, a chunk for translation will be a paragraph. This
seems like a sensible division, since it may lead to a better
translation to reorganise the sentence structure, but keeping the
paragraph structure the same should not be too much of a burden, from
my limited experience of other languages.]
Paragraph seems ok. Just get rid of any eventual <footnotes>, etc. I'd
change them into [1..n], since you have to keep the placement and then
have the footnote separate in the next entry. Of course there are other
parts to be considered, such as <entry>, etc. which should be separate
entries as well. Same goes for index terms and titles. If you keep the
order, then you should have enough context. Another easy solution would
be to use the closest index-term and a fixed url to provide (in the
comments to the entry) a link to the english HTML page. This should
provide enough context. :-)
In a normal program internationalisation effort, all strings from all
files are put into a single po file for each language. However, when
translating documentation, this approach does not seem efficient.
Firstly, it is not unreasonable to expect that only a fraction of the
documentation in any package will be initially translated. Secondly,
the po files will be much larger than for all but the largest programs,
since user and developer documentation is often quite lengthy. Typical
use of this program will therefore place the po files for each document
in their own directory (probably under the document's source directory,
or its immediate parent).
I'd say one po-file per sgml-file. In this way you can keep the
file-structure as well, which also makes it easier to put the
translations back into sgml.
[NOTE: It has also been floated that, since the source is usually
under a C/ directory, alternative translations can go in directories
labelled by their locale name -- so es/, no/, and so forth. The files
in these directory would still be .po format files so that translators
can use their current familiar techniques.]
In order to allow the documentation writer to provide translation hints
and context descriptions to the translator, any comment block
immediately prior to a message string is included in the po file as a
comment (just as gettext does). So, for example,
<!-- Insert an appropriate city for your locale -->
<para>To predict the weather in Sydney, you would... </para>
will appear in the po file as
#. Insert an appropriate city for your locale
#: weather-applet.xml:45,6
msgid "To predict the weather in Sydney, you would... "
msgstr ""
Nice idea! :-) If you get the writers to comment properly that is! ;-)
As with gettext and intltool, this program will have an update mode for
updating existing po files with new and changed strings and a report
mode for giving statistics about the current translation status for
each locale. Ideally, this will be run from the top of the source tree
and will update all relevant .po files, without creating any new ones
(new ones are created explicitly).
Where for each updated file, we should first create the entire po-file
again and then do the comparisons on the po-s, rather than trying to
figure what's been changed in the sgml-file. Which then again shows how
important it is to get that one right and comprehensive.
I'd even use the new po-file, and merge from the previous one what I
could, rather than doing it the other way around, in this way, you'll
always be able to keep the structural information of the newest sgml,
which you'll need to convert the po back eventually.
(2) Building the translated documents
[To be done. Fiddly, but not rocket science.]
What about leaving the original document INTACT? But only replacing
every string you steal with a unique identifier, e.g.
[...]
<title>#1</title>
<para>#2</para>
<para>#3</para>
[...]
This then becomes a "style-file" for every language. If you reverse with
the option --lang=pt, you get the portugese sgml, if you do a --lang=es
the spanish one. And whatever the original language was (let's assume
en_US), --lang=en_US, then exactly gives you back the original.
Here we would have one less thing to worry about. The only thing we'd
need is <haha>simply</haha> a properly working and comprehensive parser
(that, in addition, can pick up errors in tags).
As said, I really think that this is the ONLY real issue. :-)
ISSUES TO BE RESOLVED
----------------------
[snip]
- Will there be cases where a translator needs to include arbitrary
amounts of extra text to make the translation appropriate? This is
probably fairly hard to do in full generality, so if it's not
obviously required, it will be omitted.
Why is adding extra text a problem?
- Can the translator include extra block elements in their translations
of strings? Yes, but the merge will simply insert them -- no sanity
checking will be done.
What is meant with block elements? Tags? Like <para> tags? If yes, then
I'm utterly opposed to this idea and it should IMO not be permitted!
Why? Because I believe that created documents, independent of the
languages used, should result in the same structure. Well, since
different documents have different formats and I am predominantly (to be
exact exclusively) thinking of manuals written in SGML DocBook we might
be able to have an option to either enable, or, what I prefer, disable this.
- Some "non-translatable" blocks will, unfortunately, contain
translatable portions. For example, a program listing in Python is
non-translatable. However, some of the strings it prints out _are_
translatable. I have no idea how to handle this case yet, except by
having the whole example substituted by a translation. Maybe such
blocks can be preceeded by a comment saying that they contain
translatable strings?!
Who marks what is translatable and what not? Not the writers, or? At
least I don't think you can expect it of them. If you could, we could
simply introduce a new tag, such as <trans></trans> and simply collect
everything within these tags. But now that would be too easy! ;-)
Cheers,
Bernd
--
Dr. Bernd R. Groh <bgroh redhat com>
Red Hat Asia-Pacific
Disclaimer: http://apac.redhat.com/disclaimer
"Everything we know is an illusion,
nothing we know is real,
nothing real we can know,
illusion is what we call reality."
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]