[BuildStream] Proposal: Artifact as a Proto



tl;dr
-----

This email is meant to stimulate thoughts around the idea of replacing the
top level of our CASCache based artifacts with an explicit ArtifactProto
rather than the current Directory based solution.  In brief, I believe that
doing this will, in the long term, result in fewer special cases in our
CASCache fetch/integrity-checking code, and more easy management of complex
policies on local and remote caches.

Throughout this mail, nouns with capital letters will almost always (unless
exclusively used at the start of a sentence) refer to a class or gRPC proto.

It's worth noting that in writing this up, I had to delve into a lot of the
lower levels of the elements and artifacts code.  As such, I will, at times, be
less than 100% confident of things.  I hope I've made those clear so that
someone who knows better than I can chime in and affirm / correct as needed.

The status quo
--------------

In order to fully understand where I'm going with this, let's consider the
status quo, both in terms of functionality, arrangement, and where visible,
future planned work.

Current artifacts are an evolved form of the artifact structure pre-CAS.  The
'ref' is the Digest of a Directory which reprepsents the artifact as it used to
be unpacked on disk.  This means that metadata is stored in files in a
Directory referenced by that top level artifact Directory, logs are similarly
stored, etc.  This was, in a very basic way, exceedingly nice since it allowed
for identical handling of all the artifact parts.

Artifacts contain:

    meta/
    logs/
    files/
    buildtree/

The `files/` and `buildtree/` content should be self-explanatory (though it's
worth noting that while `buildtree/` is always present in the top level of the
artifact Directory structure, the provided Digest may reference a Directory
which simply does not exist in the CASCache or any of its remotes.

The `logs/` directory has content which for the most part is never referenced
by anyone at any point.  Richard Maw has recently been working on `bst artifact
log` to expose that content to the user, and that simply expects that the top
level Directory exists, but makes no assumptions about the content therein.

The `meta/` directory is of most interest to us at this point.  It (currently)
contains:

    build-result.yaml
    dependencies.yaml
    keys.yaml
    public.yaml
    workspaced-dependencies.yaml
    workspaced.yaml

The first of those (`build-result.yaml`) basically consists of an indication
that the build succeeded or failed, along with some optional detail beyond
that.  The `dependencies.yaml` is a mapping of element name to cache keys.  The
`workspaced.yaml` is purely a boolean for whether or not this build came from a
workspaced element source.  The `workspaced-dependencies.yaml` lists those
elements which came from workspaces (presumably to short-circuit the need to
look in those elements' `workspaced.yaml` metadata).  The `public.yaml`
contains the dynamic public data, which might be important to subsequent built
elements I suppose.  Finally the `keys.yaml` simply contains the computed weak
and strong cache keys for the build.

In terms of future work we can already see in progress, we know that the
`buildtree/` content is already "optional" in the local cache, though we
always push it to remote caches.  Tom is working on making that action
optional, and the final step in optionality would be to make it optional to
include into the artifact in the first place.

We know that some of our users wish to be able to trim build trees, and
possibly logs, from their longterm caches, in order to reduce space used during
the long-term archiving of artifacts.

The précis proposal
-------------------

In brief, I propose replacing all of the metadata files, and the top level
Directory with an Artifact proto.  It would look something like this:

    message Artifact {
        // This version number must always be present and can be used to
        // further indicate presence or absence of parts of the proto at a
        // later date.  It only needs incrementing if a change to what is
        // *mandatory* changes.
        int version;
            // Core metadata
        string result;   // succeeded / failed / whatever
        string details;  // optional
        string strong_key;
        string weak_key;
            bool was_workspaced
        Digest files;
        
        // Information about the build dependencies
        message Dependency {
            string element_name;
            string cache_key;
            bool was_workspaced;
        };
        repeated Dependency build_deps;

        // The public data is a yaml file which is stored into the CAS
        Digest public_data;

        // The logs are stored in the CAS
        message LogFile {
            string name;
            Digest digest;
        };
        repeated LogFile logs;  // Zero or more log files here

        // The build tree, if present, is in the CAS
        Digest build_tree;  // optional
    }

In terms of the artifact cache, we'd replace the Digest stored under the ref
with an instance of Artifact.  It would be entirely acceptable for any cache to
choose to elide optional parts of the Artifact at any time, though it'd be
sensible to ensure that if a remote cache returned an Artifact which listed a
build_tree Digest, that the digest reference a fully realisable tree present in
the associated CAS for at least an acceptable margin of time (e.g. 24 hours).

Given this is a fundamental change in how artifacts are stored, we *should*
convert old artifacts over.  This would be relatively easy and could be handled
by creating a new ref tree alongside the current ref tree, looking in the new
tree first, and looking over to the old ref and converting if necessary.  If
that is done both on the local CAS and also by `bst-artifact-server` then
migration *forward* should be painless.  With a small amount of extra work we
could also ensure that the server could also *create* the older format for use
by older clients.  We're not proposing changing the fundamental content of
artifacts when we do this, after all.

I'd expect that rather than dealing with the raw Artifact proto objects
throughout the codebase, we'd instead create a CachedArtifact class which
mirrors the behaviour of the Element class (albeit deliberately lazily) and
handles the creation and management of artifacts in the local cache.  All the
rules regarding optionality of content etc will be handled entirely within that
class, with clean interfaces for detecting the presence/absence of content in
the rest of the code.

Implications to existing code
-----------------------------

There are any number of implications to this proposal, in terms of the amount
of work to implement, but I think it's worth first hilighting the code which
I think would become simpler to deal with subsequent to such work.

We spend a lot of cognitive load (and code complexity) dealing with the
optionality of build trees.  This was predicted and accepted as a consequence
of the work, but it is in fact quite a large amount of work to manage and deal
with.  By moving build tree optionality out of the CAS' Directory structure, we
make it significantly easier to reason about the presence/absence of it and
will clean up a large number of codepaths which currently have to special case
top level Directory objects for artifacts.

The above paragraph can be summed up as "Placing semantics on otherwise generic
data structures results in higher cognitive load than specialised data
structures", or perhaps even more simply as "It's why we design classes rather
than store everything in dicts"

Obviously there's lots of current code which would need reworking in order to
untangle the artifact behaviours from the codebase.  For example, Element
understands the underlying artifact structure a little too much for my comfort,
and the `bst artifact` subcommands should, where possible, be implemented as
UI on top of behaviours of the ArtifactCache and CachedArtifact classes.

Rough implementation sequence
-----------------------------

I should note here that I do *NOT* claim to have fully mapped out this work,
and this is simply a starting point to spur thought and, eventually,
discussion.  This is therefore a grossly coarse-grained plan...

1. Confirm correct content for Artifact, and add it to the Buildstream protos
2. Create a CachedArtifact class and give it the ability to read/write
   Artifacts into a parallel ref tree by amending ArtifactCache appropriately
3. Give CachedArtifact the ability to read/write existing artifact structures
   in an attempt to provide for cross-release compability.
4. Teach the cache expiry subsystem to use the new artifact format

At this point we can "fork()" development effort since we have the new
CachedArtifact and ArtifactCache behaviours capable of handling old and new
artifacts.  This lets us handle multiple pathways of development
simultaneously:

Pushing and Pulling:

1. Add new APIs to the reference service to allow for the setting and retrieval
   of the new Artifacts, including some capability markers.
2. Ensure the artifact server prefers the new Artifact stuff, but will create
   old refs and new artifacts from each other depending on what is written to
   them.
3. Teach the artifact server to use the new artifact format's support for the
   cache expiry stuff to ensure that on `GetArtifact()` (or similar) it can
   still update the CAS cache mtimes to ensure content is anchored.
4. Teach Buildstream to push and pull new artifacts in preference to old ones,
   but to be able to fall back to transacting old artifacts (with a warning)
   if necessary.

UI/UX:

For each of the `bst artifact` subcommands, migrate them one by one to the new
CachedArtifact API, building API where needed (migrating Element functionality
where necessary).

Element:

1. Migrate Element so that instead of creating old-style artifacts it creates
   new-style ones via an appropriate CachedArtifactBuilder or similar.
2. Slowly migrate any remaining of Element's behaviours which properly belong
   in Artifact over to them, leaving compatibility calls in place so that we
   don't break public API (`__get_artifact_metadata_*()`)

I hope this gives everyone food for thought - let's speak about it in the new year.

D.

-- 
Daniel Silverstone                          https://www.codethink.co.uk/
Solutions Architect               GPG 4096/R Key Id: 3CCE BABE 206C 3B69


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]