Considering that every message entry in existing translation memories that happen to contain any en_US-typical word or spelling would need to be duplicated with a corresponding message entry with the same word or sentence and context with the en_GB spelling, it would not be a matter of a "few differing words". We're talking a significant increase here in message database size. And then we haven't even considered message entries that would happen to have *two* en_US-typical words, like, say, the message "Customize Colours". Then we would need 4 new complete database entries, one for each combination, to be able to compensate for both en_US and en_GB occurring in the original source messages. But hopefully we can assume that developers wouldn't mix en_US and en_GB spellings in the same message. Or can we? I can say for sure that I mix both spellings all the time when I write English, simply because I cannot always distinguish between them and know which spelling is which. And experience shows that many developers that aren't native English speakers have the same problem. Nevertheless, I think we have established the fact that would need a drastic increase in translation memory sizes to be as useful as before, and a corresponding increase in computer horsepower to be able to cope with it, if we want to allow both en_US and en_GB. Translation memories aren't the smallest databases, and if you run it against a translation like Nautilus with 1,492 messages that need translation, and compare it with the huge number of entries in a translation memory, you can get an estimate of the number of database comparisons needed and the amount of processing involved (yes, 1,492 × sizeof(translation_memory)). Divide it by two, since we stop searching once we've found an exact match, and we assume it takes half of the entries to search on average. So what you say? Even if the translation memory only has a few million entries, or even a few hundred thousand, say 200,000, a fast computer does those 149,200,000 database comparisons in no time. But then we've only considered exact matches. Translation memory systems usually do a lot more, picking also the most similar matches from the database if no exact match is available, which is also the most common case -- on average you get perhaps 8% exact matches when creating a new translation from scratch using a big TM. A similar match is then flagged (fuzzy in the po case) so that translators know they need to verify/fix it manually. But if you don't find an exact match you need to go through all of the database. So, assuming that 92% of the Nautilus messages need the full 200,000 string comparisons, that makes 274,528,000 string comparisons. 8% will find an exakt match after on average 11,936,000 database entry comparisons. So that's an awful lot of comparisons for producing a start of a Nautilus translation using our translation memory. Some string comparisions are trivial, some not. Let's go back to numbers. Assuming that somewhat like 5% of all messages contain at least one word in an en_US-typical spelling, or a word that is entirely en_US-specific, this would make 10,000 messages out of our 200,000 message example translation memory. The 5% number is of course a rough estimate, but probably not that far off given how much the en_CA and en_GB teams often have to translate. After all, words like "color" are common in user interfaces, and GNOME is no exception. Anyway, assume that the vast majority of these 10,000 messages, say 80%, only contain a single occurence of an en_US-typic word. Those 8,000 entries we need to double to be able to cope with en_GB variants of these messages aswell. Perhaps 15%, 1,500 messages, contain two en_US-typical words. Those need quadrupling. And say that 4%, 400 messages, contain three such words. Those need to be 8-folded. The rest, 1% or 100 messages, contain four or more en_US words. For simplicity's sake we'll assume they use four, and they need to be replaced by 16 entries each. So our translation memory database gets 16,000+6,000+3,200+1,600=26,800 entries bigger, and hence we have to check each message againt the same number more entries. In the Nautilus case, that would mean 18,393,376 more database comparisons and 3,198,848 more string comparisons, just for allowing en_GB spellings aswell as en_US ones. What we've ignored so far is though that messages aren't usually added to the TM:s this way -- TM:s are databases of previously translated messages, so there would be a need for a probability analysis aswell on what combinations would actually end up in the TM after a while and that hence could be used later on. If that exact combination isn't there our only hope is that a string comparison would return a similar combination. But similar matches require manual work by a translator to correct them, perhaps 10 times the amount of time needed for just verifying that an exact match was correct. So it's a significant decrease in likelihood that a match will be exact in practice, and hence a substantial increase in amount of translator time needed. So the counting of more database lookups isn't really relevant in practice, what's relevant in practice is the probability analysis and the reduced probability of finding exact matches, and the drastic increase in translator time needed as a direct consequence. All this increase in database sizes and computing needs and translator work because some people couldn't accept that we need to standardize on a single language in our desktop source. Given that we can and do standardize on things like amount of spacing needed in dialogs and positioning of widgets, it's quite absurd. So Bastien, I'm glad that you aren't arguing it. But let it be a lesson for those ignorant people who still claim this isn't a major problem for translation. Christian _______________________________________________ gnome-i18n mailing list gnome-i18n@gnome.org http://lists.gnome.org/mailman/listinfo/gnome-i18n
-- Rodolfo M. Raya <rmraya@maxprograms.com> Maxprograms |