Proposal for Remote Execution



Hi all,

This is a proposal to support remote execution in BuildStream. I will first
describe the overall goals and the proposed service architecture, followed
by a plan forward with some details to changes and additions required in the
BuildStream code base. As this touches many areas, I've split the plan into
multiple phases that can be merged one after the other. The initial phases
do not actually enable remote execution yet but bring local execution in
line with what is required for remote execution.


Goals
~~~~~
Remote execution enables BuildStream to run build jobs in a distributed
network instead of on the local machine. This allows massive speedups when
a powerful cluster of servers is available.

Besides speeding up builds, a goal is also to allow running builds on
workers that use a different execution environment, e.g., a different
operating system or ISA.

The goal is not to offload a complete BuildStream session to a remote
system. BuildStream will still run locally and dispatch individual build
jobs for elements with the existing pipeline.


Service Architecture and API
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This proposal builds upon Bazel's Remote Execution API¹. This allows use of
the same server infrastructure for BuildStream, Bazel, and other tools that
support the Remote Execution API.

The API defines a ContentAddressableStorage (CAS) service, an Execution
service, and an ActionCache service. They are all defined as gRPC² APIs
using protocol buffers.

The CAS is similar to a OSTree repository, storing directories as Merkle
trees. However, unlike OSTree, CAS does not include support for refs.

The Execution service allows clients to execute actions remotely. An action
is described with a command, an input directory, output paths, and platform
requirements. The input directory is supplied as a CAS digest and the result
of an action also refers to CAS digests for output files and logs. This
roughly corresponds to BuildStream's Sandbox object for local execution.

The ActionCache service is a cache to map action descriptions to action
results to avoid executing the same action again if the result is already
available. As an action description includes the CAS digest of the input
directory, any changes in the input directory will result in a different
cache entry.

The ActionCache service does not suffice as top-level caching service for
BuildStream for the following reasons:
* Incremental builds and non-strict build plans require support for less
  strict cache keys where some changes in the input directory are ignored.
* The build of a single element may involve execution of multiple actions.
  We want to be able to cache the artifact for the overall build job.
* A BuildStream artifact includes metadata generated locally outside the
  sandbox, which means that the output directory from the ActionResult
  does not constitute a complete artifact.
* As the sources are included in the input directory, BuildStream can't
  check the ActionCache service without fetching the sources.

For BuildStream I'm thus proposing to add a small artifact cache service on
top of the Remote Execution API that maps from BuildStream cache keys to CAS
artifact directories. Adding an entry to the BuildStream cache is considered
a privileged action, only trusted clients will be permitted write access.

The Execution service may still internally use an ActionCache service, of
course. However, BuildStream will not directly use the ActionCache service
API.

¹ https://docs.google.com/document/d/1AaGk7fOPByEvpAbqeXIyE8HX_A3_axxNnvroblTZ_6s/edit
² https://grpc.io/


CAS Artifact Cache
~~~~~~~~~~~~~~~~~~
As a first phase I'm proposing to introduce an artifact cache based on CAS
as defined in the Remote Execution API. This can completely replace the
existing tar and OSTree artifact cache backends with a single cross-platform
implementation.

This will add grpcio³ as hard dependency and requires code generation for
protocol buffers. To avoid a hard dependency on the code generator
(grpcio-tools), which cannot always be easily installed via pip (development
environment is required), I'm proposing to import the generated Python
code into the BuildStream repository. We can add a setup.py command to
make it easy to regenerate the code on systems where grpcio-tools is
available.

I already pushed a Python implementation of a local-only CAS artifact cache
to a branch a while ago, see WIP merge request !337. I've also prototyped
support for push and pull and a toy server.

To get this to a mergeable state, a complete solution for the server side is
required. I.e., we need a CAS server that projects can install with suitable
instructions without introducing unreasonable dependencies. We also need the
BuildStream artifact cache service described in the previous section incl.
support for handling privileged push support.

Projects will be required to migrate their artifact cache servers from
OSTree to CAS.

I'm not planning on supporting anything like OSTree's summary file, which
is a list of all available refs and the corresponding checksums. This means
that BuildStream will no longer check at the beginning of a session which
artifacts are downloadable and we can no longer skip build dependencies of
artifacts that are in the remote cache. Such checks will instead happen as
part of the pipeline. The reasons for the change are as follows:
* With artifact expiry, the artifact might no longer be available when
  we actually want to pull.
* Conversely, the artifact may become available on the remote server after
  the session has already started, see also #179.
* The OSTree summary file doesn't scale. The server has to rewrite a
  potentially huge file in a cron job and the client always has to download
  the whole file.
* We don't always know the cache keys at the beginning of the session,
  e.g., in non-strict mode or when tracking, so we need support for dynamic
  checks in the pipeline anyway.

The Remote Execution API is not stable yet, however, in this first phase we
control both client and server and we can support multiple protocol versions
in the server at the same time for transition periods.

This phase on its own does not enable remote execution. However, it still
provides benefits for local execution:
* On non-Linux systems we no longer need the tar artifact cache, which is
  much slower than OSTree/CAS and doesn't support remote caches.
* Support for LRU cache expiration is possible on the server as we can
  track artifact downloads.

This will also drop OSTree as a hard dependency on Linux. However, it will
remain an optional dependency for projects using the OSTree source plugin.

³ https://pypi.python.org/pypi/grpcio


Virtual File System API
~~~~~~~~~~~~~~~~~~~~~~~
As a second phase I'm proposing to introduce a virtual file system API that
both BuildStream core and element plugins can use in place of path-based
file system operations provided by the OS. The goal is to allow transparent
and efficient manipulation of Merkle trees.

The API consists of an abstract Directory class that supports creating
directories, copying/renaming/moving files and whole directory trees, and
importing files from a local path. See also the existing `utils` functions
`link_files` and `copy_files`, equivalent operations need to be supported.

Proposed steps with additional details:
* Create abstract Directory class.
* Implement regular OS file system backend.
* Add get_virtual_directory() to Sandbox class returning a Directory object,
  still using regular OS file system backend for now.
* Add boolean class variable BST_VIRTUAL_DIRECTORY for Element plugins to
  indicate that they only use get_virtual_directory() instead of
  get_directory(). Add error to Sandbox class if get_directory() is used
  even though BST_VIRTUAL_DIRECTORY is set.
* Port users of Sandbox.get_directory() to get_virtual_directory():
  - Element class
  - ScriptElement class
  - Compose plugin
  - Import plugin
  - Stack plugin

The steps above do not depend on CAS. However, the following steps do:
* Implement CAS backend for Directory class
* Stage sources into CAS as part of the fetch job
* Use the sources from CAS

Staging sources into CAS is problematic if e.g. the whole .git directory is
included. We should avoid this where possible. This is already a concern
for caching of build trees (#21) as well and can be improved on
independently of any other steps.


CAS FUSE Layer
~~~~~~~~~~~~~~
As a third phase I'm proposing to introduce a new FUSE layer that allows
safe access to a directory tree stored in CAS without having to extract
it to a regular file system. This is expected to reduce staging time with
local execution. For remote execution this is more important because:
* Workers and CAS may run on different machines, which means that staging
  will be a lot more expensive than our current approach with hard links.
* With build jobs that require execution of multiple actions, the already
  expensive staging would have to be done multiple times.

Added and modified files will initially be stored in a local temporary
directory and eventually stored in CAS.

We saw a huge performance overhead in our initial FUSE tests with pyfuse and
bindfs. However, in the mean time I wrote a prototype FUSE layer in C that
makes use of various optimization possibilities that FUSE supports including
zero copy reads/writes and reduced number of context switches. Initial tests
with this FUSE layer show a reduction of the overhead from factor 2-5 to
just 6% in a parallel workload. With the reduced staging time the overall
time for a build job is expected to be close to the current implementation.

This phase will change the new Sandbox.get_virtual_directory() function to
return a CAS-backed Directory object and the Sandbox implementation will
mount the FUSE layer.

The detail on how to handle the dependency of the C-based FUSE layer is open
for discussion. It might become an optional dependency. If we can hard
depend on the new CAS FUSE layer, we can completely drop the existing pyfuse
SafeHardlinks layer.

The new FUSE layer will also make it possible to track used files for #56,
however, this is not part of this proposal.


Remote Execution
~~~~~~~~~~~~~~~~
With CAS support in place in the artifact cache, the virtual file system
API, and the FUSE layer, actual remote execution support can be added.
The core part on the client side is to implement a remote execution backend
for the Sandbox class.

As part of the build job, Sandbox.run() will upload missing blobs from local
CAS to remote CAS and submit an action to the Execution service. The output
files will not immediately be downloaded, avoiding unnecessary network
bandwidth, however, the digests will be added to the virtual root directory
of the sandbox as appropriate.

The artifact directory structure (files, meta, logs) will be created locally
using the virtual file system API, and uploaded to the remote CAS as part of
the push job, if push is enabled.

Commands such as checkout that require artifact contents to be local will
download missing blobs as needed.

Element plugins that do not set BST_VIRTUAL_DIRECTORY will not support
remote execution. Depending on the configuration, projects that use remote
execution and such plugins will either produce an error message or fall back
to local execution.

On the server side a suitable worker implementation is required that uses
the same sandboxing approach and FUSE layer (or an equivalent solution).


Command Batching
~~~~~~~~~~~~~~~~
The proposed architecture aims to keep the overhead of spinning up a remote
worker low such that a single build job may execute multiple actions.
This is required as we run certain commands with different environments
(e.g., integration commands) and some plugins manipulate the file system
between commands. And it's also simpler from the API perspective, especially
with regards to backward compatibility.

It's too early to tell but the overhead may still be an issue, in which case
I'm proposing to support command batching. I.e., allow element plugins to
group multiple Sandbox.run() calls together and send them off to the
Execution service as a single action. A context manager-based API may be
useful for this.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]