Re: Translation status pages



Today at 5:20, Owen Taylor wrote:

> Something that takes 4 hours of CPU time on window (a day?) probably
> isn't a huge deal ... window isn't terribly CPU-bound currently... and
> the process could be niced down.

Stats are usually done multiple times a day.  I am not quite sure of
the current schedule, but I think Carlos is doing them at least 3 times
a day (7am, 3pm, 11pm).  If you ask translators, they'd prefer to have
them updated as often as possible.

Though, with current code, that's not realistic.  Carlos new stuff
should provide that with much lower CPU usuage, by watching CVS
directly.  Now, if only someone found time to finish the code if
Carlos doesn't make it (it's available in his svn repo somewhere on
carlos.pemas.net, I think :).

> But if it's doing significant disk work - so ejecting stuff out of
> cache, then it's going to impact all bugzilla users, all anoncvs users,
> all people accessing www.gnome.org etc. window isn't really a good place
> to run intensive jobs, because so much is going on there.

It does do significant disk work: it basically checks out entire gnome
cvs, runs "intltool-update -p" and then "msgmerge" on every single PO
file in Gnome CVS  repository (sometimes for multiple branches) and
creates hundreds of static .html files containing statistics.

> Things to do:
>
>  - Try running it on window, see if it really takes 4 hours, or 2 hours.
>    (30-45 minutes might be an OK time for an intensive task to churn.)
>
>  - Get someone to look at optimizing it. You can do an incredible amount
>    of work in 4 hours these days ... if this task is taking 4 hours,
>    it's being done inefficiently. (Not volunteering)

It is somewhat inefficient, and Carlos acknowledges it, since he has
new status pages in the works which would provide more features and
should be better suited for running on window. :)

The big CPU-bound task is running "msgmerge" with fuzzy matching (it
uses a slow string distance algorithm to find "most similar" strings
in translations, and to reuse translations; now, imagine it working 50
times for a set of 5000 strings [eg. Evolution with 50 translations]).
It's my estimate that most of the time is spent in it.  I just ran a
test drive without fuzzy matching, just as a check for my assumption,
and the runtime dropped from 4.5 hours to 1.8 hours on
i18n-status.gnome.org (it's another machine, hosted by Keld).
Note that fuzzy matching is very important for translators, so
basically, most of the process is CPU bound (some part of those 1.8
hours is also CPU bound, and the 60% time difference is definitely
CPU- and memory-related).


I have some ideas for optimising msgmerge step that will do
a significantly better job (i.e. I'd first concatenate all the PO files
to get a list of *all* English strings, and run msgmerge-style step
once for such list and a POT file, creating a table of similarity
matches: the problem with this is that it will require bit more
memory, but it should speed up the process O(n) times, where "n" is
the number of PO files/languages, provided we don't run into excessive
memory page faults and swapping :).


As for disk optimisations, storing statistics in the database is
probably way better (and that's what Carlos' new code does) than
generating hundreds of .html files.

Another disk optimisation is handling of cvs checkouts.  For different
reasons, all cvs checkouts are usually done in full, i.e. checkout is
first removed, and only then is it "cvs co"ed again.  If there was no
hand tuning of cvs repositories in Gnome, maybe "cvs up -Pd" would be
sufficient?  I don't know enough details of CVS hacking to answer
this, but the basic thing is that we need to insure pristine CVS tree
before running "intltool-update -p" and msgmerge.

>  - If it really is that intensive, it's not optimizable, we need it on a
>    gnome.org server, than container is probably the most appropriate
>    home:
>
>     window: 2 gig ram, 72gig (raid 1) disk, load avg ~1 
>     container: 6 gig ram, 500gig (raid 5) disk, load avg ~0.2

Any machine with sufficient CPU power and low load will do.  But, some
amount of disk-bound work is still necessary, because we are anyway
talking about working with/parsing full CVS code to find extractable
strings, and then working on each PO file in turn.

Cheers,
Danilo


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]