Re: [BuildStream] Scheduler optimization



I realized I forgot to add numbers to defend this case in the email,
so a few precisions:

When I say a "large number of worker" I mean between 8 and 12.

Here is a benchmark for "bst build essential-stack.bst" from
https://gitlab.com/jennis/debian-stretch-bst:

4 builders: 189s
8 builders: 178s
12 builders: 205s

We clearly need to do something here, as the RE part will include an
infinite number of builders.

Cheers,
Ben

On Thu, Mar 28, 2019 at 9:30 AM Benjamin Schubert
<ben c schubert gmail com> wrote:

Hey everyone,

TLDR: I want to refactor the queues and scheduler from their current
model to a push based model.
I would need feeback on the approach to see if I'm not missing something.
I would also need to know if there is any objections to my plan since
this would be a
significant refactoring effort.


I'll explain why and how in the rest of the email

---

I've been looking at the scheduler and the queues and from what I can see in
some benchmarks and profiles, they are becoming an important bottleneck,
especially with a large number of workers.

A few issues have been raised over time about that:
- https://gitlab.com/BuildStream/buildstream/issues/824
- https://gitlab.com/BuildStream/buildstream/issues/943

Currently, the queues go over every element until they find enough elements to
fill up all resources. While this works fine with few workers, since the
probability of finding a job that is ready early in the queue is quite high,
the performance degrades when more workers are added, as the probability to have
many jobs ready early decreases.

The problem becomes even worse when we are using remote execution with many
workers.

I therefore think we should be pushing items in the queue as soon as they are
needed.

This is in details what I think should be done:

1) Don't store a list of all elements in each queue.
    - Currently, every queue contains everything, and we need to go over it
      everytime.

2) Remove Queue.status() method, allowing to check if an element is ready
    - Currently, this method is called on every element of the queue,
in sequence,
      until the queue has enough elements to use all the resources.

3) When setting an element for tracking, put it in the `track` queue.

4) When calling "set_required" on element, put the element in the fetch_queue.
    - Currently, what happens is that the element is already in the
queue, and when
      "set_required" is called, a boolean is toggled, making the
status of the element
      go from WAITING to READY, and allows the queue to pick it.

5) When an element is put in the build queue, if all its dependencies
are not ready,
add the element to a list in every dependency that is not ready. Whenever an
element becomes done in this queue, go over this list and for every ready
element put it in the queue.
    - Currently, what happens is that the element is already in the
queue, and whenever
      it's status is checked, it will check if all its dependencies
are available before
      saying that it is ready.


All other queues should just get the elements from the previous queue pushed
to them when ready, as they don't have dependencies on other elements.

One concern that was raised in the past was that we would lose the ordering
of the build and that would decrease performance.

We could prevent this by using a priority queue and sort elements when adding
them. We would therefore still keep the stable ordering we have now.


Does this plan seem sensible? Am I missing something? Any feedback is welcome!
I plan to start working on this next week unless anyone has any
objections to the
plan.

Cheers,
Benjamin


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]