Re: [BuildStream] Proposal: Moving UI into a subprocess



TLDR:

  * Sure, I agree we can do this, Benjamin has raised some more data
    which appears to justify it.

  * This is going to be a lot of work, and changes the core very dramatically,
    no matter how you slice it, please don't underestimate this.

  * Scheduler.run() needs to be called *multiple times* in the same session,
    it's already a bug that we are tracking/fetching/pulling things outside
    of the regular `Scheduler.run()` harness, as pointed out before here[0]

    Important to consider this because state needs to be perfectly consistent
    in *between* multiple calls to `Scheduler.run()`

  * Asides from the above, this message presents a few abstract but incomplete
    ideas of directions we could take to go about this.


Hi again,

On Fri, 2019-05-17 at 17:15 +0100, Phil Dawson wrote:
Thanks both for the responses. I've replied inline.

[...]
When running BuildStream, the main process uses 100% of a processor. There
is therefore a real bottleneck here. The scheduler shouldn't need that much.
I'd be much happier if the scheduler could use 100% of it's own process, having
the UI updated separately.

I also think that when adding Remote Execution in the picture, with many more
processes to orchestrate, we will still have this bottleneck, evne with all
the other optimisations.


Thanks Benjamin for providing this extra detail.

Tristan, is this enough to convince you that having the UI in a separate 
process to the scheduler is worth pursuing (implementation details aside)?

I already replied with some hints on how we can improve logging
performance here[0], I'm certain we can do better without resorting to
splitting up roles and responsibilities into multiple processes.


Thanks for those, I indeed think they should be implemented. I however don't
think this is enough.

+1 here.

So:

  * I did suspect that the UI updates are too heavy already, while I
    don't usually notice any slow down in my regular usage, I can tell
    that it is a lot (under heavy loads I see a lot of flicker in the
    rendering).

  * I wasn't really sure that we would *still* need this separation
    after lessening the work we do in the main process as is.

    It seems Benjamin has pointed out that even when the UI updates
    every second, it is somehow taking 100% of a single CPU. On any
    fairly modern machine this is certainly a problem.

    The fact that this is under `bash.exe` on windows in particular,
    and that running `bst ... 2>&1 | tee` is making things run
    noticeably faster, leads me to believe that bash.exe is having
    a very hard time processing the the ANSI escape sequences (which
    are all removed by piping the output and disconnecting bst from
    the terminal).

    Still, even if this is the case of course I agree we can improve
    performance of BuildStream when running in suboptimal environments.

So yes sure I have no objection to doing this, I wanted mostly to make
clear that "logging in the main process is taking 50% of the main
process" is not an observation which is indicative of a problem, we
should not be basing large refactors on observations which by
themselves do not indicate actual bottlenecks.

[...]
<snip>
     One thing to point out is that no matter how we slice things, we are
     still going to have the issue of synchronizing element state across
     processes, as state mutates over the course of a session, and the
     logger needs to display state.

I would hope that the worker process could eventually give all the necessary
information to the frontend process, that would remove any problem arrising from
this and would remove lots of state synchronisation.

As far as state synchronisation goes, as far as I understand, the 
following bits of state are needed by the front-end.

1) The element lookup table, this will be populated after load time and 
won't change during a scheduler run, so I don't think we'd any 
synchronisation here.

2) Element names. Again, these are available after load time and won't 
change during a scheduler run, so I don't think we'd need to communicate 
this back to the main process.

3)  Element cache keys. I'm not sure if the cache keys as displayed by 
the front-end can change during a scheduler run. I feel like they 
probably do, but I'm not sure.

4) The status of the scheduler queues to be displayed on the status bar.

5) Cache usage.

Am I missing anything here?

There are a few ways to fry this cat...

First of all, I'm not particularly fond of enumerating the various
things one by one which need to be synchronized with the frontend
process for the sole purpose of logging - this indicates that we think
that the frontend will never be enhanced and will only ever need to
know what we are telling it right now.

Another thing which I think is extremely important to consider, is that
all side effects on the data model caused by a single invocation of
`Scheduler.run()` need to be observable after `Scheduler.run()`
completes, at which point the calling code can observe the side effects
before making further calls to `Scheduler.run()` on the same data
model.

As I pointed out (in coincidentally the same thread) here[0], the
`Scheduler.run()` function was never intended to be something that is
only ever called once in a single invocation of BuildStream, and I
really consider it to be a bug that we are not calling
`Scheduler.run()` for *all* tasks which we offload to elements (like
fetching of junctions, fetching or tracking of sources as a side effect
of opening workspaces, or pulling of build trees as a side effect of
opening a bst shell, *everything* should be passing through
`Scheduler.run()`).

Note that I am not being specific about *which* process this needs to
happen in, but certainly, multiple calls to `Scheduler.run()`, along
with any code in between, needs to have a reliable data model which
represents exactly what happened as a side effect of processing any
jobs on that data model.


One solution can be:

  * Make sure that *all state* is always mirrored into the frontend
    process, and every call to `Scheduler.run()` spawns a new
    subprocess in order to run the scheduler.

  * Ensure that when any given call to `Scheduler.run()` completes,
    we process all income state update messages in the frontend at
    the end of that subprocess and don't lose any state (ensuring
    that we have a consistent data model in between calls to
    `Scheduler.run()`).

Another solution potentially be:

  * Refactor the frontend logging mechanism so that the full context
    of every message is serialized in the Message() object, and the
    frontend never needs to access the data model (perhaps the
    Message() object in this case loses the task_id and unique_id
    members completely, and we provide all fields in every message)

    That would be a pretty hefty rewrite, but would also have the
    advantage of reflecting the state at the time of which a message
    was sent to the frontend, rather than reflecting the state which
    happens to be in the frontend when the message arrives.

    I.e., one could imagine a scenario where:
    - The messages are printed every second, or twice a second
    - A workspace build completes and state is synchronized
    - The frontend decides to print messages about the workspace build

    Here we might end up seeing the workspace cache key in messages
    before the final message where the cache key was actually resolved.

    If we had snapshotted exactly everything that needs to be printed
    in a frontend message and sent it in the message itself, then we
    wouldnt have this problem.

  * Run the entire body of any `Stream()` function in a subprocess.

    This would also include loading the project itself.

  * In this approach we don't have the problem of synchronizing state
    at all really, as the frontend never needs any state at all and
    only performs logging.

    Side effects from one call to `Scheduler.run()` remain in the
    same process where subsequent calls to `Scheduler.run()` will
    occur.

While this second approach has some benefits, it also makes interactive
debugging a bit harder, i.e. when a build fails and it is time to shell
into a build sandbox in order to debug what happened - we may need to
reload the data model in the frontend just for the sake of this.

In this case, we would not want to load files from the disk again, in
my experience, my builds run for a very long time, and I am probably
doing work on the same BuildStream project, patching a separate branch,
editing files and such, while a build is ongoing - so I certainly
expect that BuildStream knows what it loaded and is able to present me
with a debug shell of the build I launched 2 hours ago, without
consulting my filesystem to do so.

In *any* scenario:

  * We're going to anyway need a generic place, probably in Stream()
    where we can launch the subprocess and setup an IPC for that,
    process, running an asyncio event loop in the frontend for
    the sake of:

    - Handling user events (SIGINT, SIGTERM, SIGTSTP)

    - Listening for incoming logging messages, probably coming from
      various sources (i.e. we can have logging messages coming
      from the scheduling process, or they might also come directly
      from jobs).

  * This harness will need to mirror some of the Scheduler() mechanics,
    i.e. the frontend will need to send explicit IPC messages to the
    subprocess, informing it to:
    - suspend all jobs now
    - resume all jobs now
    - kill all jobs now

Another solution might be a cross between the two, i.e. I could
envision a possible, rather undesirable approach, where the frontend
spawns a subprocess for every single call to `Scheduler.run()`, and
then *recalculates state* in the frontend process after having
completed a call to `Scheduler.run()` and before observing any side
effects or making any subsequent scheduler runs - this is of course
undesirable because we duplicate work by re-resolving state in the
frontend which was already resolved in a subprocess.

Cheers,
    -Tristan


[0]: https://mail.gnome.org/archives/buildstream-list/2019-April/msg00063.html




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]