Re: Statistics for each GNOME translator's work



> [: Simos Xenitellis :]
> Such a thing would be useful if, for example, you want to make an
> announcement of the localisation of GNOME 3.0 to your language and you
> want to show how much work each translator did.

On a philosophical note, I'm not sure whether such statistics would be very
purposefull. It could even induce wrong motivations. I don't see such
statistics for programmers either (except anonymously in surveys),
especially not in any milestone-type announcements.

It's hard to define "how much a translator did". For example, reviewers may
be doing a lot of *removing* of superfluous verbosity, which is typical for
inexperienced translations from English into more inflexive languages.

In practice I've found that Last-Translator header field is not very
accurate for contribution assignment. Frequently this field is not properly
updated (when someone doesn't work strictly with a dedicated PO editor), or
a commit after a quick fix by the reviewer is committed before the
translator's original content (so Last-Translator is set to the reviewer),
and even name spellings or ordering change.

I also thought about providing this statistics for my team in KDE -- due to
workflow particularities it would be easy to implement, quick to compute and
precise down to the word -- but decided against it for the above reasons.
(Even not taking into account the accuracy of contribution assignment, since
we have it implicitly correct by the workflow.)

> The current algorithm is
> 1. Obtain the before and after versions of a PO file.
> 2. Use 'pocount' to count the translated strings in both, note down the
> different
> 3. Use 'podiff' to count the 'changed' messages (message updates).
>
> Disadvantages
> a. 'pocount' shows sometimes [...]

You could use poediff from Pology. It produces an "embedded diff" (ediff for
short) between PO files, described here:

http://techbase.kde.org/Localization/Tools/Pology/PO_Embedded_Diffing

If you already got two versions of the catalog exported from the repository,
you would execute:

  poediff -b LANG-revA.po LANG-revB.po > ediff.po

(-b is to ignore obsolete messages). Or, if you want to operate just with
revisions:

  poediff -c git -b path/in/repo/LANG.po -r revA:revB > ediff.po

The resulting ediff.po is a valid PO file in itself, containing all the
modified messages between revisions. Every modified message can have {-...-}
and {+...+} ediff segments in msgid field (if the message was fuzzied and
then updated, what was the change in original) and msgstr field (what was
the change in translation). Diffing is done is on the word level, so these
segments would contain full words.

Then, as a contribution between the two revisions, I would compute the
number of "equivalent translated words" for each message as follows:

* If the message was added or its msgid was modified, count the words inside
  {+...+} segments in msgid.

* If only the msgctxt was added or modified, check if msgstr is modified: if
  yes, count full msgid again; if not (context was not significant), count 1
  bonus word for the unfuzzying effort.

* If there is no change in msgid or msgctxt but there is in msgstr, this was
  a review update of translation. Separately count the words in {-...-} and
  in {+...+} segments in msgstr, and add the larger one to the total. Or
  possibly average it.

* If the above results in zero word count (e.g. if the change in original
  was only in punctuation or markup), add one or more equivalent words to
  indicate some effort was still spent. For example, the larger of the
  number of removed and added segments in msgid.

Here's a code using Pology that would do this counting without resorting to
raw string operations (not thoroughly tested):

  def count_equiv_trans_words (ediffpo):

      from pology.file.catalog import Catalog
      from pology.misc.diff import word_ediff_to_old, word_ediff_to_new
      from pology.misc.diff import word_ediff_to_rem, word_ediff_to_add
      from pology.misc.split import proper_words

      cat = Catalog(ediffpo)
      hctxt = cat.header.get_field_value("X-Ediff-Header-Context")
      num_eqwords_total = 0
      for msg in cat:
          if msg.msgctxt == hctxt: # inserted header-diff message
              continue
          msgid_old = word_ediff_to_old(msg.msgid)
          msgid_new = word_ediff_to_new(msg.msgid)
          if msgid_new is None: # removed message
              continue
          msgctxt_old = word_ediff_to_old(msg.msgctxt)
          msgctxt_new = word_ediff_to_new(msg.msgctxt)
          msgstr_old = word_ediff_to_old(msg.msgstr[0])
          msgstr_new = word_ediff_to_new(msg.msgstr[0])
          # ...for plural messages only singulars are taken into account.

          num_eqwords = 0
          # Added message or modifed original text.
          if msgid_old != msgid_new:
              num_eqwords = len(proper_words(word_ediff_to_add(msg.msgid)))
          # Added or updated context.
          elif msgctxt_old != msgctxt_new:
              if msgstr_old != msgstr_new:
                  num_eqwords = len(proper_words(msgid_new))
              else:
                  num_eqwords = 1
          # Modified translation only.
          elif msgstr_old != msgstr_new:
              words_add = proper_words(word_ediff_to_add(msg.msgstr[0]))
              words_rem = proper_words(word_ediff_to_rem(msg.msgstr[0]))
              num_eqwords = max(len(words_add), len(words_rem))

          # No changed proper words in translation.
          if num_eqwords == 0:
              if msgid_old != msgid_new:
                  segs_add = word_ediff_to_add(msg.msgid, sep=None)
                  segs_rem = word_ediff_to_rem(msg.msgid, sep=None)
                  num_eqwords = max(len(segs_add), len(segs_rem))
              else:
                  # Some other change (could be in translator comments).
                  num_eqwords = 1

          num_eqwords_total += num_eqwords

      return num_eqwords_total

Unfortunately, Pology too is not yet a released piece of software, and
poediff is a bit on the slow side.

-- 
Chusslove Illich (Часлав Илић)

Attachment: signature.asc
Description: This is a digitally signed message part.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]