Re: Use of American/British English



Hi Christian,

- You are right about one thing: adding more entries to a database increases database size.

- You are wrong about one thing: text searches in a database are not performed as you described.

When dealing with huge numbers of entries in a TM database, good translations tools don't review each and every entry in the database when searching for a translation. There are special techniques for generating fuzzy indexes and those indexes are used to find exact/fuzzy matches.

Suppose you have a database with 2,000,000 entries and you are looking for a short sentence that contains 5 words. According to your description a translation tool should start comparing the source sentence against all entries and either stop when a perfect match is found or after 2,000, comparisons. That's bad and inefficient.

In real world, translation tools calculate a special index for each of those 2,000,000 entries and store all values in the database. When you want to search something, the index for your short sentence is calculated and a special algorithm is used to check if there are perfect matches or matches with a given similarity level (usually 70%). After index calculation is done, only the relevant records are fetched and compared.

To put numbers in a different way: if you have 2,000,000 entries in a database and there are only 10 matches for a given sentence, a decent translation tool may perform 15~20 comparisons to retrieve the desired matches.

Regards,
Rodolfo

On Tue, 2004-02-24 at 22:02, Christian Rose wrote:
Considering that every message entry in existing translation memories
that happen to contain any en_US-typical word or spelling would need to
be duplicated with a corresponding message entry with the same word or
sentence and context with the en_GB spelling, it would not be a matter
of a "few differing words". We're talking a significant increase here in
message database size.

And then we haven't even considered message entries that would happen to
have *two* en_US-typical words, like, say, the message "Customize
Colours". Then we would need 4 new complete database entries, one for
each combination, to be able to compensate for both en_US and en_GB
occurring in the original source messages. But hopefully we can assume
that developers wouldn't mix en_US and en_GB spellings in the same
message. Or can we? I can say for sure that I mix both spellings all the
time when I write English, simply because I cannot always distinguish
between them and know which spelling is which. And experience shows that
many developers that aren't native English speakers have the same
problem.

Nevertheless, I think we have established the fact that would need a
drastic increase in translation memory sizes to be as useful as before,
and a corresponding increase in computer horsepower to be able to cope
with it, if we want to allow both en_US and en_GB. Translation memories
aren't the smallest databases, and if you run it against a translation
like Nautilus with 1,492 messages that need translation, and compare it
with the huge number of entries in a translation memory, you can get an
estimate of the number of database comparisons needed and the amount of
processing involved (yes, 1,492 × sizeof(translation_memory)). Divide it
by two, since we stop searching once we've found an exact match, and we
assume it takes half of the entries to search on average.

So what you say? Even if the translation memory only has a few million
entries, or even a few hundred thousand, say 200,000, a fast computer
does those 149,200,000 database comparisons in no time.

But then we've only considered exact matches. Translation memory systems
usually do a lot more, picking also the most similar matches from the
database if no exact match is available, which is also the most common
case -- on average you get perhaps 8% exact matches when creating a new
translation from scratch using a big TM. A similar match is then flagged
(fuzzy in the po case) so that translators know they need to verify/fix
it manually. But if you don't find an exact match you need to go through
all of the database.

So, assuming that 92% of the Nautilus messages need the full 200,000
string comparisons, that makes 274,528,000 string comparisons. 8% will
find an exakt match after on average 11,936,000 database entry
comparisons. So that's an awful lot of comparisons for producing a start
of a Nautilus translation using our translation memory. Some string
comparisions are trivial, some not.


Let's go back to numbers. Assuming that somewhat like 5% of all messages
contain at least one word in an en_US-typical spelling, or a word that
is entirely en_US-specific, this would make 10,000 messages out of our
200,000 message example translation memory. The 5% number is of course a
rough estimate, but probably not that far off given how much the en_CA
and en_GB teams often have to translate. After all, words like "color"
are common in user interfaces, and GNOME is no exception.

Anyway, assume that the vast majority of these 10,000 messages, say 80%,
only contain a single occurence of an en_US-typic word. Those 8,000
entries we need to double to be able to cope with en_GB variants of
these messages aswell. Perhaps 15%, 1,500 messages, contain two
en_US-typical words. Those need quadrupling. And say that 4%, 400
messages, contain three such words. Those need to be 8-folded. The rest,
1% or 100 messages, contain four or more en_US words. For simplicity's
sake we'll assume they use four, and they need to be replaced by 16
entries each.

So our translation memory database gets 16,000+6,000+3,200+1,600=26,800
entries bigger, and hence we have to check each message againt the same
number more entries. In the Nautilus case, that would mean 18,393,376
more database comparisons and 3,198,848 more string comparisons, just
for allowing en_GB spellings aswell as en_US ones.


What we've ignored so far is though that messages aren't usually added
to the TM:s this way -- TM:s are databases of previously translated
messages, so there would be a need for a probability analysis aswell on
what combinations would actually end up in the TM after a while and that
hence could be used later on. If that exact combination isn't there our
only hope is that a string comparison would return a similar
combination. But similar matches require manual work by a translator to
correct them, perhaps 10 times the amount of time needed for just
verifying that an exact match was correct. So it's a significant
decrease in likelihood that a match will be exact in practice, and hence
a substantial increase in amount of translator time needed. So the
counting of more database lookups isn't really relevant in practice,
what's relevant in practice is the probability analysis and the reduced
probability of finding exact matches, and the drastic increase in
translator time needed as a direct consequence.


All this increase in database sizes and computing needs and translator
work because some people couldn't accept that we need to standardize on
a single language in our desktop source. Given that we can and do
standardize on things like amount of spacing needed in dialogs and
positioning of widgets, it's quite absurd.


So Bastien, I'm glad that you aren't arguing it. But let it be a lesson
for those ignorant people who still claim this isn't a major problem for
translation.


Christian

_______________________________________________
gnome-i18n mailing list
gnome-i18n gnome org
http://lists.gnome.org/mailman/listinfo/gnome-i18n
--
Rodolfo M. Raya <rmraya maxprograms com>
Maxprograms


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]