Re: [BuildStream] Proposal: A small number of subprocesses handling jobs



Hi,

On Mar 4, 2019, at 5:03 PM, Jürg Billeter <j bitron ch> wrote:

[...]
The python core libraries themselves release the GIL responsibly when
calling into C libraries, I suspect that with BuildBox integration we
should be doing the same.

Virtual staging, i.e., combining the files/trees of the build
dependencies, is not planned to be moved to BuildBox.  BuildBox will
require an already merged tree as input, just like remote execution. 
With the recent (+ pending) optimizations, virtual staging has become
much faster, however, it might still be a bottleneck if we have to do
it in the main Python process.

Right, we should see how this bottleneck measures up to the cost if a fork().

While threading makes some implementation details tricky, it also
simplifies other parts (state synchronization would become simpler,
queues would not have to inform the data model of changes at all).

Moving the state logic to the main process is what simplifies these
parts.  This would be the case for both the threading approach and the
async approach.

Yes, you are right.

Also I think my original question needs answering, how heavy were the
builds in the sample which shows that spawning a process is
unreasonably slow, how do we know this is a non negligible overhead ?

With the knowledge that most builds will themselves spawn many
processes anyway, why is it worth making such drastic changes ?

I agree that this question should be answered, however, my main
motivation is not fork(2) overhead but rather:
* The already mentioned simplified state handling.
* Avoid the issue with (e.g., gRPC, OSTree) background threads in the
  main process.
* Allow long-living gRPC connections, see #810. buildbox-casd would
  mitigate this, though, as shared connection is much less important
  for a local service.
* Possible future native support for Windows, which doesn't support
  fork(2). Although, I don't see this happening in the foreseeable
  future.

I think the main question here is whether a plugin should be explicit about what gets processed in parallel 
or not.

Even if it only costs a context manager to paint a code fragment as eligible for parallelism, this makes 
plugin writing more complex than it needs to be, and imposes an undesirable rule/relationship with the core 
(if we find better ways to e.g. defeat the GIL in the future, we are still stuck with this awareness of 
parallelism in our API).

I rather value our freedom of change and simplicity of plugin API much more than I dislike dealing with 
synchronizing the state we currently do, and dealing with weird libraries which spawn threads inadvertently, 
I.e. this category of problem can be handled in the core and we should always choose to place complexity in 
the core rather than in the plugin, whenever given the choice.

I would personally rather be reluctant about imposing this explicit parallelism knowledge on the plugin API, 
and seek other justifications (performance ?) before making plugins aware of what is processed where.

My comments so far have been focusing on Element methods
stage()/prepare()/assemble(). Sources also need to be considered,
though.  Some source implementations are CPU intensive.  They might be
significantly easier to hand off to a worker pool, though, as their API
surface is much more limited, as far as I can tell.

Right, one can safely assume that the Element API is off limits for sources, as a Source is never given 
access to an Element.

Cheers,
    -Tristan




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]