[BuildStream] Responsive, but not overly verbose UX - not-in-scheduler

From: Daniel Silverstone <daniel silverstone codethink co uk>
To: buildstream-list gnome org
Subject: [BuildStream] Responsive, but not overly verbose UX - not-in-scheduler
Date: Thu, 18 Apr 2019 08:46:10 +0100
tl;dr
-----

We need to be better at feedback to the user when not in the scheduler.  I
propose a combination of a context manager, a lightweight thread, a little
configuration, and some rules around UX messaging, to ameliorate things.

For context, a user of a fast system with a small project (7000ish elements,
via one junction, already cached in `.bst`) might see (much like today) a
sequence such as:

    [--:--:--] START   Loading elements
    [00:00:02] SUCCESS Loading elements
    [--:--:--] START   Resolving elements
    [00:00:04] SUCCESS Resolving elements
    [--:--:--] START   Resolving cached state
    [00:00:02] SUCCESS Resolving cached state

Where the `Resolving elements` phase provided a progress indicator during
the operation.

A user of a slower system might, for the same project see something more like:

    [--:--:--] START   Loading elements
    [00:00:02] START   > Checking out junction: junctions/debian-junction.bst
    [00:01:20] SUCCESS > Checking out junction
    [00:01:42] SUCCESS Loading elements
    [--:--:--] START   Resolving elements
    [00:00:31] SUCCESS Resolving elements
    [--:--:--] START   Resolving cached state
    [00:00:47] SUCCESS Resolving cached state

Where 'checking out junction' simply gave a spinner, the rest of the loading
elements gave a non-sized progress indicator, and each of the resolving phases
gave proper progress indicators with ETAs during the run.

Background and goal
-------------------

The goal of this proposal is to lay out how we might design a pre-scheduler or
post-scheduler UX (basically anything where we're not already using our fancy
UX) to better inform the user and keep them engaged with `bst`.  We have seen
that on larger and larger projects, or with the use of platforms with slower IO
such as WSL, or even Linux on a latent network filesystem, certain operations
can take quite a while (upwards of an hour merely to load the debian-stack
elements, let alone resolve them or show the pipeline, in one extreme case).

In the cases that operations, ones some developers may consider fast, take
longer than a few seconds, we really ought to be providing information to the
user so that they can be confident that the tool is still running, still doing
work, and hasn't crashed or got stuck in some fashion.

There are a number of examples of times when this would be useful - "Loading
elements" can take 30 to 40 seconds to load the Debian stack, even on a
super-fast computer which has all the IO cached in RAM.  When a junction needs
to be downloaded in order to proceed on element load, it can take even longer.
Then there's "Resolving Elements" or "Resolving cached state".  Building the
pipeline takes an appreciable number of seconds, though we don't even give a
START/SUCCESS pair for that currently.

Once we're into the scheduler, feedback is usually excellent, with a time
ticker, jobs starting, providing status, completing, progress being made, etc.
so I'm not proposing we change any of that yet.  What we want is something
which enhances the experience of users on slower systems (or larger projects)
without being overly verbose/messy for users on fast systems with small
projects.

Rough idea
----------

As a project policy, we require that every operation which could take between
1s and 5s on the slowest of systems **MUST** provide a start and stop message
by means of some kind of context manager such as:

    with LongRunningOperation("Loading Elements"):
        stuff = load_elements()

The internal UX behavioural rules applied for this will be:

1. If the operation takes less than 0.5s then the output will be entirely
   suppressed.
2. If the operation takes more than 0.5s, the start message gets emitted.
3. If the start message was emitted, then the success/failure message will be
   emitted when the context manager is exited, based on whether or not an
   exception is being propagated.
4. If the operation takes longer than 2s, a "spinner" will start to be shown
   which will automatically tick at approx 4Hz

Optionally an operation which is considered "important" such as loading
elements may pass `immediate=True` to the context manager which will cause the
start message to be immediately displayed, thus ignoring 1 and short-circuiting
2 from the rules above.

Any operation which can usefully provide feedback will be able to do so by
means of methods on the context manager, such as:

    with LongRunningOperation("Resolving Elements", length=len(elements)) as ticker:
        for element in elements:
            element.resolve()
            ticker.tick()

Here, the rules above still apply, but rather than a pure "spinner" at 4Hz,
instead a progress indicator (with spinner) will be displayed instead.  The UI
update frequency will be clamped at 4Hz however, to prevent UI cost becoming
too high.

A ticker which has a known length will display an estimate of how long it might
take to complete the operation given how many elements are left and how many
have been completed since the operation started.  It's important that each
operation be of roughly equal length as a result.

ETA estimate can be suppressed with an argument `eta=False`.

Next, we need to consider long running operations which can make reportable
progress but perhaps can't pre-indicate how long they'll take, such as loading
of elements in the first place...

    with LongRunningOperation("Loading elements", immediate=True) as ticker:
        for some kind of loop:
            ticker.need_to_complete(thiselement.deps)
            ticker.completed(thiselement)

In this context, `need_to_complete()` takes an iterable whose values will be
added to a set of things which must be completed in order for the ticker to be
complete.  `completed()` takes a value which, if not in that set, is added to
it, and is added to a set of things which are completed.  This allows for a
progress indicator which might start at: `0/1` proceed to `1/8` then `6/45`,
etc.  until eventually settling on `74325/74325`.

This is the most problematic of ticker kinds, and what the exact API needs to
be to make this possible may vary.  It'd be better if we could just store
counters in the context manager but element loading is a good example of where
sets might be more useful.  More thought is needed here.

Finally we need to consider how nesting will work.  If we come back to the
loading of elements as our "most complex" example, we need to consider how to
next long running operations because loading elements can cause junctions to be
staged, which can cause them to be fetched, resulting in a fetching/staging
orgy of time consumption which would otherwise block the nice smooth counting
up of the ticker.  So...

1. If a long running operation is nested within another, the start-message
   output rules apply upwards.  i.e. once a nested operation outputs its
   start message (even if immediate=True) then outer operations must output
   theirs first.
2. If a long running operation is nested within another which is currently
   ticking, the ticker is promoted to the inner operation, being removed
   from the outer one when the start message is output for the inner one.
3. Once an inner operation terminates, the ticker (if any) returns to the outer
   one.
4. If an outer operation reaches a time threshold while an inner operation has
   not (for example an operation starts, an inner operation starts 0.4s later,
   0.1s passes, the outer operation's start message is emitted, the inner
   operation terminates 0.3s later, without writing any messages, the outer
   operation terminates, emitting its finished message)

Configuration
-------------

In order to make this useful for people, we need some level of configurability
which can control the behaviour of the context manager.  I'd suggest:

1. A boolean to supress the tickers, even on a TTY
   (defaulting to not suppress)
2. A boolean to suppress the delay (essentially always immediate-mode)
   (defaulting to not suppress)
3. Settings for the initial delay, ticker delay, and ticker rate.
   (defaulting to 0.5s, 2s, 4Hz respectively)


Summary
-------

The overall aim is to produce a UX which on the face of it looks and feels just
like the current one when things are proceeding quickly, and which provides
feedback to the user just as soon as things don't appear to be proceeding
quickly enough.  The critical point is that we should never, at any point,
leave the user wondering if the tool has crashed / locked-up.

-- 
Daniel Silverstone                          https://www.codethink.co.uk/
Solutions Architect               GPG 4096/R Key Id: 3CCE BABE 206C 3B69
Follow-Ups:
- Re: [BuildStream] Responsive, but not overly verbose UX - not-in-scheduler
  - From: Tristan Van Berkom
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]