Re: Stop the train ! Caching build trees is going to be too big

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Phillip Smyth <phillip smyth codethink co uk>
Cc: BuildStream <buildstream-list gnome org>
Subject: Re: Stop the train ! Caching build trees is going to be too big
Date: Sat, 28 Apr 2018 21:54:37 +0900

On Fri, 2018-04-27 at 13:44 +0100, Phillip Smyth wrote:

Hi Tristan,

Just a quick follow up email on some of my thoughts for this problem.

On Fri, Apr 27, 2018 at 06:09:46PM +0900, Tristan Van Berkom wrote:

Hi all.

This is just a quick email to raise a problem early, make sure that we
adjust our expectations, take a pause and fix a big flaw in our plan.

So, last week we identified that, it is not going to be realistic to
blindly cache build trees, because VCS data tends to cost a damn lot of
disk space (feel free to substitute "damn lot" with less family
friendly wording for dramatic effect).

For this, we opened this issue to block it:

    https://gitlab.com/BuildStream/buildstream/issues/376

But the buck doesn't stop here, unfortunately.

For example, my workspace directory for WebKit (from a *tarball*, with
no VCS data added), costs me 5.8GB of disk space after a build. This is
only the source code we mean to build, plus the resulting object files.
The object files in the `_build/` subdirectory cost 5.6GB, so the
source code is only a couple of hundred MB.

To put this in perspective; when we started building GNOME against a
debian sysroot runtime, which costed about 3GB, it was quite annoying
because it takes a *damn long time* to download the base runtime before
we even start building.

Introducing a 5.8GB download for a prebuilt WebKit artifact is just not
gonna fly, we cannot start introducing these downloads into the build
process.

What I propose that we do, is the following:

  * Split artifact keys in two:

    * The regular artifact remains "${project}/${element}/${key}"

    * The cached build tree is addressable as
      "${project}/${element}/${key}/build"


I believe this is already done, as the build tree cache is currently being stored in a subdirectory of the 
artifact.


Yes I understand this.

I think if the core functionality can be modified to download all subdirs excluding the build tree cache, 
this issue be avoided.


Anything is essentially possible, this is software after all :)

But, it is a simpler API contract (and probably also a simpler
implementation) to just address what we want to download separately,
than to ask that Artifact Cache implementations support fancy semantics
to allow partial extraction and partial downloads for addressable
blobs.

    * Alternatively, we split the artifact into metadata, logs,
      output and build components, this remains to be discussed
      and analyzed.

  * Uploading of the build tree to artifact shares remains mandatory

    * We should ensure integrity of artifact share servers

    * In the usual cases, regular users do not contribute to artifact
      shares anyway, automated build servers do this part

  * Downloading the build tree of an artifact must only ever be done
    on demand


Could we add a "--cached-build-tree" flag to bst build?
So that if the flag is false, it builds using it's sources
And if it is true, then it can download the cached build tree and use that instead


    * We could have an option to force download all the sources if
      we expect to need them later for offline work, but this is
      not mandatory in order to land the feature I think


I'm not certain about what you mean by this.
Are you suggesting an option to download the entire cache to a location on your local machine?


No, I'm essentially suggesting something similar to what you are
suggesting above: a flag to inform `bst build` and `bst pull` that we
definitely want to download the build trees along with the rest of the
artifact contents, for any artifacts which we do download.

The reason why I add:

    "...if we expect to need them later for offline work"

Is because:

  * You only need the cached build trees for certain activities, like
    say if you want an enhanced `bst shell` experience where all the
    object files are in context.

  * You might not know in advance if you're going to need to run
    `bst shell`, or open a workspace and desire an incremental build.

  * If you have an option to download these at `bst build` time, and
    also at `bst pull` time, then you can have these build trees in
    advance, so they will be available when working offline.


The above option however, I feel is more of an enhancement for
predicting offline work, it does not preclude that the option to
download these should most certainly be made available to the commands
which actually make use of them.

In other words:

  * `bst workspace open` should let the user decide to download the
    cached build trees at the time of opening the workspace, if the
    the cached build tree is not already available locally.

  * Similar options should be made available to any commands for which
    the UX is improved by cached build trees.

Cheers,
    -Tristan

References:
- Stop the train ! Caching build trees is going to be too big
  - From: Tristan Van Berkom
- Re: Stop the train ! Caching build trees is going to be too big
  - From: Phillip Smyth

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]