Re: [BuildStream] Nuking the pickle jobber.

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Sander Striker <s striker striker nl>
Cc: BuildStream <buildstream-list gnome org>
Subject: Re: [BuildStream] Nuking the pickle jobber.
Date: Tue, 16 Jun 2020 15:44:18 +0900

Hi,

Thanks for bringing it to the list.

On Mon, 2020-06-15 at 22:52 +0200, Sander Striker wrote:

Hi Tristan,

Angelos responded on the !1965 (
https://gitlab.com/BuildStream/buildstream/-/merge_requests/1965#note_361444144
).  I'll quote it below for completeness.

Cheers,

Sander

I think at this point, having this feature in it's present state
serves mostly as an obstacle to any progress and refactoring,
although it would be great if someone wanted to own this feature,
fix it, and eventually complete #1021.


Unfortunately Mac, Windows, and "plugin replay" support didn't draw
the huge amount of interest I'd hoped for; so perhaps this is for the
best. Its nice to see code become less complicated at least.

Here are some other points in favour of keeping these changes, in
case one of them is persuasive :)

First, it may not be obvious that this is also a step back for
support on Mac. As of Python 3.8, the spawn method of starting child
processes is the default as fork is considered unsafe, more details
here. That means that state must be pickled over to the child jobs.


The recklessness with which python developers break API across minor
point versions never ceases to amaze me :)

That said, we can certainly force the method to use "fork", which is
what I expect many other python programs which were previously using
multiprocessing will be doing if they need Mac support.

[...]

Unfortunately errors during pickling are notoriously tricky to debug,
which certainly adds drag.


This cannot possibly be overstated.

As I described in the MR, the whole process of figuring out what was
going awry and fixing one of the easier instances of the problem ate a
full day of my time (for one of two problems, I am still blocked by the
fact that instantiating a plugin requires executing Loader code in
order to search for the junction from whence a plugin should be
loaded).

For someone less motivated, this kind of wall costs us a would-be
contributor who I would expect to give up on our project entirely when
faced with such a problem.

For someone equally motivated but without prior knowledge of how object
references work, it could have easily cost a few days or even a week to
figure it out on their own.

With the current motion to adopt PEP484[0], I can predict that
necessary shuffling of object relationships to eliminate implied cyclic
dependencies will cause this brand of issue to resurface multiple
times.

I think that whatever happens, the code in it's present form is
unacceptable in master - this may be controversial to say but IMO: code
introduced into master should not only work, but should at the very
least be designed to minimize the future cost of ownership as much as
is conceivable at the time we add code to master.

From a full reading of your message, I would propose the following

changes in order to make this burden acceptable:

  * The act of clearing a member which should not be pickled
    (regardless of whether it is in the regular case or in the
    "Job specific" case), should be implemented as a decorator.

    E.g.:

       def __init__(self, name):

           # Clear this member upon serialization
           #
           @serialization.clear
           self.name = name

    Avoiding any duplication of knowledge in __getstate__,
    __setstate__, or get_state_for_child_job_pickling().

  * The same should be true of custom serialization of attributes,
    e.g.:

       def __init__(self, name):

           # Use a custom serialization routine for this attribute
           #
           @serialization.custom(_serialize_name, _deserialize_name)
           self.name = name

  * jobpickler.py should not have intimate knowledge of the classes
    which are being pickled for the sake of job pickling.

    Instead we should have a base interface which does the
    get_state_for_child_job_pickling() implementation automatically
    for attributes specifically marked with decorators. This could
    be called `JobSerializable` and implemented/inherited by classes
    which need to buy into this.

    jobpickler.py then would simply override the dispatch table
    once for this custom thing:

        pickler.dispatch_table[JobSerializable] = _reduce_object

    Having these decorators and this class(interface) achieves two
    things:

      * Added safety, no need to declare all of this variable handling
        in multiple places, or add assert statements as is discussed
        here[1].

      * Provides an actual interface to document, where we explain
        to developers "how to do it", instead of requiring developers
        actually read and deeply understand a complex conversation
        going on in jobpickler.py.

  * Similar to the above, we should have a `JobUnserializable`
    interface which raises a *useful* error when __getstate__ is called
    on that object.

    This exception should have in it's message, a list of as much
    context as possible, at the very least it should indicate:

       * Which objects or functions have a direct reference on the
         object at the time of serialization

       * By what means they are referred to by this direct reference
         (i.e. it should not be too hard to list which object attribute
         is referencing the unserializable object).


What to do now ?
~~~~~~~~~~~~~~~~
I am personally a bit torn here, I want other developers to weigh in on
this.

The facts on the ground are:

  * This is not a problem for Mac, we can always forcefully set
    multiprocessing to use the "fork" method.

  * This is a blocker for win32 native support, and that is huge,
    I don't want to lose track of this, bringing it back will also
    be difficult after ditching it.

  * I don't personally believe this should be a blocker for the
    activity of subprocessing the scheduler. We should be able to
    get away with a single (possibly expensive on WSL) fork() early
    on and be able to do regular IPC over Queues to meet our needs.

    This was never expected to be "easy" anyway.

And our options are:

  1) Get rid of this serialization stuff

  2) Refactor/reimplement the serialization stuff to reduce the cost
     of ownership, and reduce the burden of knowledge required to
     generally write BuildStream code.

     This also requires solving the issue of instantiating plugins
     in subprocesses via the "spawn" method (which does not work for
     plugins loaded through the new "junction" origin).

     If we decide to take this route, I would probably still try to
     resolve the plugin instantiation issue first in order to land
     long standing !1901 ASAP.

  3) Solve the issue of instantiating plugins in subprocesses via the
     "spawn" method without any refactoring of object serialization.

     As I've pretty much stated, I feel strongly opposed to this
     approach, I don't think the current code is viable.


Keeping in mind that I've been working on !1901[2] for a long time now
(on and off while we finally agree on a design) and that this very high
priority branch is blocking on a decision, where should I be spending
my time ?

Best Regards,
    -Tristan

[0]: https://www.python.org/dev/peps/pep-0484/
[1]: https://gitlab.com/BuildStream/buildstream/-/merge_requests/1463#note_194278834
[2]: https://gitlab.com/BuildStream/buildstream/-/merge_requests/1901/

Follow-Ups:
- Re: [BuildStream] Nuking the pickle jobber.
  - From: Sander Striker

References:
- [BuildStream] Nuking the pickle jobber.
  - From: Tristan Van Berkom
- Re: [BuildStream] Nuking the pickle jobber.
  - From: Sander Striker

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]