Re: About translating documents (.xml/.sgml) in GNOME
- From: Malcolm Tredinnick <malcolm commsecure com au>
- To: Simos Xenitellis <simos74 gmx net>
- Cc: GNOME Documentation List <gnome-doc-list gnome org>, gnome-i18n gnome org, jdub perkypants org
- Subject: Re: About translating documents (.xml/.sgml) in GNOME
- Date: Thu, 30 Jan 2003 11:07:59 +1100
[Including gnome-i18n again, since I would like comments from
translators. Also including Jeff, since he and I were discussing this
proposal just over a week ago and he hasn't seen the written spec.]
On Wed, Jan 29, 2003 at 07:54:27PM +0000, Simos Xenitellis wrote:
> I am interested in the translation of .xml/.sgml files in GNOME 2.2
> and I would like to ask if:
> 1. ...the following procedure using sgml2po is ok.
Some discussion on this below (but I have no clear opinion on the
viability, since I am not a translator).
> 2. ...is there a list of docs to translate first?
Short answer: no. However, for maximum usefulness, my guess would be to
start with the user documentation for popular applications (where you
get to choose what is "popular").
This might be a good opportunity to float another approach to this
problem. I looked at DV and Jonathans' work last year and I didn't think
it entirely met the requirements we might have. So I sat down and wrote
out a specification of what might be in such a tool (a documentation
version of intltool, essentially).
I have attached a fairly different approach to the problem from Simos'
which I would appreciate some feedback on. My main problem is that I am
not a translator, so while I have tried to incorporate what I see as the
main issues, there may be things I don't realise just because I don't
speak Norwegian or something like that.
As an aside, one of the motivations of this approach was to also make it
easy to translate websites without requiring special markup in the HTML.
I have code for some of the following (mostly extracting the strings for
translation and constructing something that is pretty close to a valid
.po file). If people cannot shoot too many holes in my method I will
continue down this path, although that does not help Simos with his
immediate problem of translating _now_.
Probably I should point out that the main problem I see in Simos'
approach is that you will get a lost of unnecessary stuff in the .po
file, which looks like it will interfere with smooth translations. Also,
I could not see how some of the "issues to be resolved" were addresses
by that code (to be fair, the authors mentioned at the time that it was
a prototype of an idea).
Malcolm
--
He who laughs last thinks slowest.
A documentation translator -- design document
==============================================
($Date$)
GOAL:
------
Much GNOME documentation exists in the form of DocBook-SGML, DocBook-XML
and (X)HTML documents. Translators are most comfortable working with GNU
gettext-style po files. The aim of this program is to provide an efficient
means of converting documentation source into po files and then merging the
resulting translations back into a document for distribution.
CONSIDERATIONS:
----------------
- The source documentation should not require special markup. It should
be possible to feed in an arbitrary HTML or DocBook-XML page and get
out a file of strings to be translated.
- The format used for translation should be in an identical format to
the po files generated by tools such as gettext and intltool.
- Efforts should be made to remove as much markup as possible from the
message strings in the po files to ease the burden on translators.
However, care also needs to be taken not to remove too much markup
and to keep contextually related chunks together.
- There should be a way for document authors to pass comments into the
po files, similarly to the way gettext extracts comments preceding
translatable strings currently.
- The package build system should, ideally, automatically create
the documents for each language from the various po files. It should
also be possible to only build the docs for one language to decrease
the package size (e.g. make DOC_LANG=en docs).
DESIGN IDEAS:
--------------
The are two halves to this program. The first part is extracting all of the
translatable strings into the po files, ready for translation. The second part
is creating translated documents from the po files at build time.
(1) Extracting the strings
When run, the program is given a list of tags which are considered
"block elements" (by analogy with the concept in HTML). These tags are
the ones which do not have significant bearing on the ability of their
contents to be translated. So they are dropped in the conversion to po
format. By way of example, in HTML we would consider the following tags
to be amongst those which are block elements: p, h1, h2, br, hr, table,
and so on.
Determining the list of block elements is a separate task (and
customisable at run time). However, for convenience, the program will
already understand the block elements from HTML and DocBook 4.1, so
invoking with the --html or --docbook options will suffice in those
cases.
The program can also be passed a list of "do not translate" elements.
These would include tags like programlisting in DocBook, where it is
almost always going to be correct to leave the source text
untranslated.
Each message string in the po file will then be as large a chunk as is
possible to extract from the document without including block elements
or "do not translate" elements. Any additional markup between such
elements will be included in the message string, since they provide
information for the translators.
[NOTE: Typically, a chunk for translation will be a paragraph. This
seems like a sensible division, since it may lead to a better
translation to reorganise the sentence structure, but keeping the
paragraph structure the same should not be too much of a burden, from
my limited experience of other languages.]
In a normal program internationalisation effort, all strings from all
files are put into a single po file for each language. However, when
translating documentation, this approach does not seem efficient.
Firstly, it is not unreasonable to expect that only a fraction of the
documentation in any package will be initially translated. Secondly,
the po files will be much larger than for all but the largest programs,
since user and developer documentation is often quite lengthy. Typical
use of this program will therefore place the po files for each document
in their own directory (probably under the document's source directory,
or its immediate parent).
[NOTE: It has also been floated that, since the source is usually
under a C/ directory, alternative translations can go in directories
labelled by their locale name -- so es/, no/, and so forth. The files
in these directory would still be .po format files so that translators
can use their current familiar techniques.]
In order to allow the documentation writer to provide translation hints
and context descriptions to the translator, any comment block
immediately prior to a message string is included in the po file as a
comment (just as gettext does). So, for example,
<!-- Insert an appropriate city for your locale -->
<para>To predict the weather in Sydney, you would... </para>
will appear in the po file as
#. Insert an appropriate city for your locale
#: weather-applet.xml:45,6
msgid "To predict the weather in Sydney, you would... "
msgstr ""
As with gettext and intltool, this program will have an update mode for
updating existing po files with new and changed strings and a report
mode for giving statistics about the current translation status for
each locale. Ideally, this will be run from the top of the source tree
and will update all relevant .po files, without creating any new ones
(new ones are created explicitly).
(2) Building the translated documents
[To be done. Fiddly, but not rocket science.]
ISSUES TO BE RESOLVED
----------------------
- Some elements contain attribute values that must be localised. For
example, images often need localisation and their location is stored
in the src attribute of the img element (in HTML). Maybe the program
should accept a list of elements + attributes from which to extract
information like this. [Aside: for image, sometimes it is sufficient
to identify the images requiring localisation and replace the images
themselves at build time. But sometimes, the link is external (e.g. a
local mirror, or an image of an appropriate capital city), so the
attribute value itself needs to be changed.]
- Complete localisation may involve adding attributes to certain
elements. For example, the list-style attribute of the ul and ol tags
in HTML should be set if appropriate (to provide Chinese numbers in
Chinese language locales, for example). Probably the style should be
set in the stylesheet (see next point) so that the markup in the
source document is locale-independent. Need to investigate what other
problems like list numberings could arise in this area.
- The build process should permit each locale to optionally specify a
different stylesheet. How?
- The build process should permit each locale to specify an encoding
for the final document. By default the encoding will be UTF-8, but
anything supported by iconv is acceptable. How? [Aside: This
requirement is important, since East Asian languages, in particular,
benefit from using a more traditional encoding than UTF-8 -- the
document can shrink to two thirds the size of the equivalent UTF-8
version when using Big-5 for Chinese, for example.]
- Will there be cases where a translator needs to include arbitrary
amounts of extra text to make the translation appropriate? This is
probably fairly hard to do in full generality, so if it's not
obviously required, it will be omitted.
- Can translators specify extra (locale-specific) files to be included
in their localised version? Not too difficult to do.
- Can the translator include extra block elements in their translations
of strings? Yes, but the merge will simply insert them -- no sanity
checking will be done.
- How should translator credits be inserted into the translated version
of the document, without adding an unnecessary empty string in the
source version? Some special markup required here (maybe a comment
with a specific format).
- Some "non-translatable" blocks will, unfortunately, contain
translatable portions. For example, a program listing in Python is
non-translatable. However, some of the strings it prints out _are_
translatable. I have no idea how to handle this case yet, except by
having the whole example substituted by a translation. Maybe such
blocks can be preceeded by a comment saying that they contain
translatable strings?!
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]