Re: [BuildStream] Proposal: Moving UI into a subprocess

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Phil Dawson <phil dawson codethink com>, buildstream-list gnome org
Subject: Re: [BuildStream] Proposal: Moving UI into a subprocess
Date: Fri, 17 May 2019 18:31:45 +0900

TLDR;

  o Let's determine that this is actually a problem, 50% of the main process
    is not proof that there is a bottleneck (I don't doubt that we are
    currently inefficient, but let's improve our reporting here).

  o Let's try less drastic measures first. We are in the process of optimizing
    state resolution, and there is a lot we can do to reduce the cost of
    logging too (details below).

  o If we really get to a place where we need to split responsibilities into
    separate processes - then keep the frontend in the frontend please. I think
    it makes a lot more sense to have each call to `Scheduler.run()` do it's
    work in a subprocess.


Hi,

So I'm going to take a step back here and reply this in a few sections.


Let's determine if this is a problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you are raising this, I presume it might already be a problem
indeed, but please can we have some details ?

[...]

In Daniel's latest batch of profile analysis [1]. He noted that UI 
rendering was consuming around 50% of the straight line cpu time of the 
parent `bst` process. As such, it may make sense to delegate the UI 
rendering to a subprocess to avoid blocking the scheduling of useful 
work while rendering UI events.


I think this is a non sequitur, it is not necessarily true that:

 * If the main process is spending half of it's time in logging...
 * It is a problem and we should fix it.

We need to know that this is in fact a problem before going to such
extremes as proposed in this mail.


Spending half of the main processes's work in logging doesn't mean that
we are bottlenecking on logging, it only means that half of the work
that is being done is being done in logging.

Consider that we are currently improving how cache keys get calculated,
there is a plan in place to push resolved keys onto the reverse
dependencies such that cache key calculation becomes cumulative, with a
goal of eliminating the need to ever call Element.dependencies() in
cache key resolution algorithms entirely, causing the algorithm to be
nearly linear if possible.

The result of this will mean that:

 * The remaining logging being done in the main process will be more
   than 50% of the work.

 * The logging will be *less* of a bottleneck than it was before, even
   if it constitutes more than half of the work being done, and even
   without improving logging performance.


Let's try some less aggressive measures first
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I already replied with some hints on how we can improve logging
performance here[0], I'm certain we can do better without resorting to
splitting up roles and responsibilities into multiple processes.

First, consider - is it really that important that we update the status
bar and spit out log lines at a sub-second frequency ?

* In a session that has a lot of ongoing tasks, naturally has a more
  expensive status bar to update (more tasks in the status bar)

* Similarly in a session that has a lot of ongoing tasks, we get a lot
  of start/stop messages, plus a lot of status messages, often we see
  many messages appear in the same second.

Compounded together, of course the logging is going to get
exponentially more expensive with a lot of ongoing tasks.

Instead of unconditionally updating the UI at every incoming message,
we can limit the frequency at which we update the terminal whilst
preserving the integrity and ordering of incoming messages.

When the scheduler is already running, we already have a ticker which
we use to unconditionally update the status bar, that would be a good
time IMO to:

  * Render all of the messages received since the last "tick"
  * Update the status bar only once at the end

This would have the additional benefit of reducing a lot of "flicker"
we can sometimes experience when overly updating the UI and processing
a lot of messages (in these times, we definitely update the status bar
way more than what is needed).

One thing to keep in mind in this approach, is we should still handle
the error messages immediately upon receiving them, at least in the
case when we know that it will result in pausing all ongoing tasks and
launching an interactive prompt.

I think we should definitely improve this area regardless of other more
extreme measures we might take as it will improve the UI experience and
reduce overall work anyway, and then we should revisit my point above
and reevaluate again whether it is still a bottleneck.


Split up work in multiple processes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If:
  * We have reduced the work in calculating state, allowing the main
    process to breathe easier
  * Reduced the redundant work we do in logging by logging on a fixed
    timer rather immediately than with every incoming message, further
    allowing the main process to breathe easier
  * It is *still* demonstrably a bottleneck

Then we should consider splitting up the work into separate processes.

However, the way I see it putting logging into a subprocess is just
about the worst way I can imagine for going about this - running
`Scheduler.run()` in a subprocess separated from frontend activities
would make a lot more sense.

Let's take an abstract look at what things need doing asides from the
jobs themselves, and where it is appropriate for those things to be
done:

  * User interaction (prompts + SIGINT,SIGTERM,SIGTSTP handling)
  * Logging
  * Resolution of element state (including cache interrogations)
  * Dispatching of jobs
  * Management of local state (files in the .bst/ directory)

It seems obvious that resolving element state and dispatching of jobs
should be done in the same process, probably that is the right process
to be responsible for managing local state too - at least local state
needs to be synchronized there in memory, even if it is not the process
responsible for synchronizing that state to the underlying disk.

For logging, it really should be coupled with handling events:

  A.) We currently have a strong guarantee that no more than one
      process will ever write to stderr/stdout, we certainly want to
      keep that guarantee, and ensure that logging is always coherent
      and synchronized (one message is *never* interrupted by another
      message, nothing is ever accidentally truncated).

  B.) When the user hits ^C (SIGINT) or we hit a failed build and enter
      an interactive session, it should be impossible for a logger in
      an orthogonal subthread to be sharing stderr/stdout with the
      frontend thread which is handling user interaction.

      We don't want a user prompt to be rudely interrupted by a stray
      latent message being printed to stderr, and really we shouldnt
      have to deal with the complexity of synchronizing this, these
      activities all belong in the frontend.

One thing to point out is that no matter how we slice things, we are
still going to have the issue of synchronizing element state across
processes, as state mutates over the course of a session, and the
logger needs to display state.

One could argue that Messages could be evolved to serialize everything
which needs to be displayed in a message, but I think it is more
frictionless to simply have a scenario like we do, where the logger is
given the element unique ID and is allowed to query the build graph for
any state it might want to display (frictionless in the sense that the
frontend is free to evolve without code changes in the core).

Cheers,
    -Tristan


[0]: https://mail.gnome.org/archives/buildstream-list/2019-April/msg00070.html

Follow-Ups:
- Re: [BuildStream] Proposal: Moving UI into a subprocess
  - From: Benjamin Schubert

References:
- [BuildStream] Proposal: Moving UI into a subprocess
  - From: Phil Dawson

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]