> [: Simos Xenitellis :] > Such a thing would be useful if, for example, you want to make an > announcement of the localisation of GNOME 3.0 to your language and you > want to show how much work each translator did. On a philosophical note, I'm not sure whether such statistics would be very purposefull. It could even induce wrong motivations. I don't see such statistics for programmers either (except anonymously in surveys), especially not in any milestone-type announcements. It's hard to define "how much a translator did". For example, reviewers may be doing a lot of *removing* of superfluous verbosity, which is typical for inexperienced translations from English into more inflexive languages. In practice I've found that Last-Translator header field is not very accurate for contribution assignment. Frequently this field is not properly updated (when someone doesn't work strictly with a dedicated PO editor), or a commit after a quick fix by the reviewer is committed before the translator's original content (so Last-Translator is set to the reviewer), and even name spellings or ordering change. I also thought about providing this statistics for my team in KDE -- due to workflow particularities it would be easy to implement, quick to compute and precise down to the word -- but decided against it for the above reasons. (Even not taking into account the accuracy of contribution assignment, since we have it implicitly correct by the workflow.) > The current algorithm is > 1. Obtain the before and after versions of a PO file. > 2. Use 'pocount' to count the translated strings in both, note down the > different > 3. Use 'podiff' to count the 'changed' messages (message updates). > > Disadvantages > a. 'pocount' shows sometimes [...] You could use poediff from Pology. It produces an "embedded diff" (ediff for short) between PO files, described here: http://techbase.kde.org/Localization/Tools/Pology/PO_Embedded_Diffing If you already got two versions of the catalog exported from the repository, you would execute: poediff -b LANG-revA.po LANG-revB.po > ediff.po (-b is to ignore obsolete messages). Or, if you want to operate just with revisions: poediff -c git -b path/in/repo/LANG.po -r revA:revB > ediff.po The resulting ediff.po is a valid PO file in itself, containing all the modified messages between revisions. Every modified message can have {-...-} and {+...+} ediff segments in msgid field (if the message was fuzzied and then updated, what was the change in original) and msgstr field (what was the change in translation). Diffing is done is on the word level, so these segments would contain full words. Then, as a contribution between the two revisions, I would compute the number of "equivalent translated words" for each message as follows: * If the message was added or its msgid was modified, count the words inside {+...+} segments in msgid. * If only the msgctxt was added or modified, check if msgstr is modified: if yes, count full msgid again; if not (context was not significant), count 1 bonus word for the unfuzzying effort. * If there is no change in msgid or msgctxt but there is in msgstr, this was a review update of translation. Separately count the words in {-...-} and in {+...+} segments in msgstr, and add the larger one to the total. Or possibly average it. * If the above results in zero word count (e.g. if the change in original was only in punctuation or markup), add one or more equivalent words to indicate some effort was still spent. For example, the larger of the number of removed and added segments in msgid. Here's a code using Pology that would do this counting without resorting to raw string operations (not thoroughly tested): def count_equiv_trans_words (ediffpo): from pology.file.catalog import Catalog from pology.misc.diff import word_ediff_to_old, word_ediff_to_new from pology.misc.diff import word_ediff_to_rem, word_ediff_to_add from pology.misc.split import proper_words cat = Catalog(ediffpo) hctxt = cat.header.get_field_value("X-Ediff-Header-Context") num_eqwords_total = 0 for msg in cat: if msg.msgctxt == hctxt: # inserted header-diff message continue msgid_old = word_ediff_to_old(msg.msgid) msgid_new = word_ediff_to_new(msg.msgid) if msgid_new is None: # removed message continue msgctxt_old = word_ediff_to_old(msg.msgctxt) msgctxt_new = word_ediff_to_new(msg.msgctxt) msgstr_old = word_ediff_to_old(msg.msgstr[0]) msgstr_new = word_ediff_to_new(msg.msgstr[0]) # ...for plural messages only singulars are taken into account. num_eqwords = 0 # Added message or modifed original text. if msgid_old != msgid_new: num_eqwords = len(proper_words(word_ediff_to_add(msg.msgid))) # Added or updated context. elif msgctxt_old != msgctxt_new: if msgstr_old != msgstr_new: num_eqwords = len(proper_words(msgid_new)) else: num_eqwords = 1 # Modified translation only. elif msgstr_old != msgstr_new: words_add = proper_words(word_ediff_to_add(msg.msgstr[0])) words_rem = proper_words(word_ediff_to_rem(msg.msgstr[0])) num_eqwords = max(len(words_add), len(words_rem)) # No changed proper words in translation. if num_eqwords == 0: if msgid_old != msgid_new: segs_add = word_ediff_to_add(msg.msgid, sep=None) segs_rem = word_ediff_to_rem(msg.msgid, sep=None) num_eqwords = max(len(segs_add), len(segs_rem)) else: # Some other change (could be in translator comments). num_eqwords = 1 num_eqwords_total += num_eqwords return num_eqwords_total Unfortunately, Pology too is not yet a released piece of software, and poediff is a bit on the slow side. -- Chusslove Illich (Часлав Илић)
Attachment:
signature.asc
Description: This is a digitally signed message part.