[BuildStream] Proposal: A small number of subprocesses handling jobs
- From: Jonathan Maw <jonathan maw codethink co uk>
- To: buildstream-list gnome org
- Subject: [BuildStream] Proposal: A small number of subprocesses handling jobs
- Date: Fri, 22 Feb 2019 17:04:27 +0000
Hi all.
I've been looking at optimisations, and it seems that there is a
significant amount
of time spent in forking off to new processes. (jennis' profile showed
126s in a 576s
build).
I propose that we can reduce this by instantiating a small number of
subprocesses
and having them perform jobs, instead.
There is a summary at the bottom if you're not interested in the
details.
# Changes in detail
This is my first time looking at the scheduler in detail, so my
understanding
may be incorrect, so in the places I propose to make changes, I will
outline
how I think it works now, then how I would change it.
## Jobs
### Now
Currently, a Job is a combination of a subprocess' messaging framework
to send
messages to the parent process, the handler of all multiprocessing
logic, and
the handler for doing the actual work.
The main process runs an event loop for the entire duration that
subprocesses
are running, and the loop is subscribed to the queue in
`Job._parent_start_listening()`
The child process does not receive any messages from the main process
for
its entire duration.
### What I'd change
I would separate the Job class into a Job that contains the actual work
logic,
and a WorkerSubprocess that handles the subprocess handling and
messaging.
Beyond this, I'm less certain. Currently there is no way to send a
message
to a Job, it is created with all the information it needs.
My thought right now is for the Scheduler to create a
Multiprocessing.Queue
that every WorkerSubprocess is subscribed to, where the Scheduler puts
Jobs
into the queue, and the WorkerSubprocesses pop jobs from the queue when
they're ready.
## Resources
### Now
`buildstream/_scheduler/resources.py` contains the Resources object.
This keeps
track of how many of a resource can be allocated, and which ones are
currently
allocated.
`buildstream/_scheduler/queues/queue.py` is responsible for reserving
resources and dispatching jobs.
Resources currently supports four kinds of resource:
* CACHE, i.e. whether a job needs to access the artifact cache.
* DOWNLOAD, i.e. whether a job needs to download something.
* PROCESS, i.e. whether a job is processor-intensive.
* UPLOAD, i.e. whether a job needs to upload something.
### What I'd change
I would add a new resource type, SUBPROCESS, which all jobs need to
claim.
This is a little bit silly, as every kind of job needs a subprocess, but
it
makes use of a common code path.
# The big problem
The big problem with moving to this kind of model is that we need to
synchronise the state of the pipeline to the worker subprocesses.
Previously, that happened automatically as the creation of a new
subprocess
creates a copy of the process' current memory, and so the state was
synchronised to the start of the job, and a job did not need to
resynchronise
during its lifetime.
This is a problem because there are parts of the pipeline which do not
remain
static. I don't have a complete list of all the ways a pipeline changes,
but I
know of:
* An element's ref is not known until its sources have been tracked.
* An element's cache key is not known until its ref is known, and the
cache
keys of all its dependencies.
* An element's public data may be altered at runtime.
Any information passed to an existing subprocess has to be pickled and
unpickled, so ideally this would need to be as little as possible.
I am not sure how I would go about providing this information, assuming
I can
track down and isolate all the parts of the pipeline that change during
a
build.
The most efficient would probably to have every Job report the exact
state
changes in their completion callbacks, every WorkerSubprocess has a
queue to
read inbound state changes from, and the Scheduler push state changes to
every worker other than the one that sent it.
It would be good to have a prototype of this new model as soon as
possible,
serialising, deserialising and updating the state is a lot of work
within
python, whereas forking does this automatically at a low-level.
# Summary
In summary:
1. I propose creating a WorkerSubprocess that pulls Jobs from a queue
populated by the Scheduler.
2. Keeping the pipeline state synchronised is a Hard problem that I'm
not
confident I have an answer to.
3. It is very important to test whether this actually saves us time.
--
Jonathan Maw, Software Engineer, Codethink Ltd.
Codethink privacy policy: https://www.codethink.co.uk/privacy.html
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]