[BuildStream] Plans for workspaces and incremental builds



Recently I've been thinking about workspaces and how they currently work
versus how they should work in the future. One of the main goals is to facilitate remote execution (RE) builds of workspaced sources in addition to local build support.
I've had some initial thoughts about this.

In order to support RE, workspaces will be staged via the sourcecache. This
will fundamentally change the nature of workspaces from their current
implementation such that test expectations should be revisited: a scheduled process no longer affects the directory on the local filesystem (wsdir). (This change was committed in !1563[1].) In this context a process is something
encapsulating any rule-based change (such as a build).

`f(x) = x' = T_x`

Consequently, the post-process wsdir key is identical to the pre-process wsdir key and the concept of key stability can be removed: WS keys do not require resetting and post-process recalculation and meaningful keys are obtained at
staging.

In order to support incremental builds it will be necessary to have a mechanism to produce the difference of source trees (`h(x,y) = d`) and apply a difference
(`h^-1(x,d) = y`). It will also be necessary to track a previous
state of the workspace.

Currently only successful builds are tracked in the workspace (via the
persisting workspace metadata) but I think this must change to track the last WS key regardless of the success of the process. Assuming that the previous digest is stored then the associated build tree is recoverable via the cache.
The scheme for incremental builds could then be expressed as:

1. Given current workspace state `y`, and stored input state `x => T_x`
2. Verify that `h^-1(x, T_x) == T_x`  If this verification fails, then
incremental build cannot continue and we should fall back to `f(y) = T_y`
3. Compute the delta between `x` and `y`: `h(x,y) = d`
4. Apply that delta to the previous build's output: `h^-1(T_x, d) => y'`
5. Apply the process to that new input state: `f(y') = T_y'`

Assuming that `f()` represents a sane build system, we can believe that the application of `f()` to `y'` will produce a build tree functionally equivalent, if not identical, to `f(y)` (`T_y = T_y'`). The verification step in 2 may fail if, for example, a build system chooses to remove one of its inputs as part of
the build process.

In addition to storing the source digest of the previous wsdir on each process it will be useful to store the dependency hash and the artifact ref (necessary for application of the source difference). If the dependency hash changes
between processes then a complete build will be required rather than an
incremental build.

I would like to get the opinions of the list on this before moving further ahead. There is a development branch removing the concept of cache key stability and key
recalculation[2] which currently seems to only fail
`tests/integration/shell.py::test_workspace_visible`. In summary:

* remove unstable cache key concept
* do not reset or recalculate workspace cache keys
* store source digest, dependency hash, and artifact ref for workspaces
* introduce mechanism to diff and apply trees
* add logic to decide to continue or abort incremental builds

[1] https://gitlab.com/BuildStream/buildstream/merge_requests/1563
[2] https://gitlab.com/BuildStream/buildstream/tree/traveltissues/benchmark-3

Best Regards,
Darius


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]