Re: [BuildStream] Proposal: Artifact as a Proto



Hi Daniel,

Thanks for writing this up.

On Fri, 2018-12-21 at 16:28 +0000, Daniel Silverstone via BuildStream-
list wrote:
The status quo
--------------

In order to fully understand where I'm going with this, let's consider the
status quo, both in terms of functionality, arrangement, and where visible,
future planned work.

Current artifacts are an evolved form of the artifact structure pre-CAS.  The
'ref' is the Digest of a Directory which reprepsents the artifact as it used to

Just to have clear terminology. The 'ref' is not a digest, it's a name
that _points_ to the Digest of a Directory. I.e., similar to git/ostree
refs except that there are no commit objects.

be unpacked on disk.  This means that metadata is stored in files in a
Directory referenced by that top level artifact Directory, logs are similarly
stored, etc.  This was, in a very basic way, exceedingly nice since it allowed
for identical handling of all the artifact parts.

[...]

The précis proposal
-------------------

In brief, I propose replacing all of the metadata files, and the top level
Directory with an Artifact proto.  It would look something like this:

As I see it, there are two aspects that may make sense to be discussed
somewhat separately:

 * Replace the generic ReferenceStorage service
 * Change serialization format for artifacts

Replace the generic ReferenceStorage service
--------------------------------------------
This is about replacing the generic ReferenceStorage service (providing
a feature very similar to git/ostree refs) with BuildStream-specific
services that have knowledge of the top-level structure of the content
and the metadata structure. Besides the Artifact proto and
corresponding service you're proposing here, we would need something
similar for the planned SourceCache feature.

Advantages:
 * Server can have Artifact-specific (and in the future SourceCache-
   specific) logic. Could be useful for more advanced cache expiration
   logic.

Disadvantages:
 * General feature enhancements in the future: Server may need to be
   updated before updated client can be (fully) used. This can slow
   down feature development/deployment.
 * Specific example: Server needs to be extended/updated when
   introducing the SourceCache feature
 * May lead to duplication of work for scalable server implementation
   (see section 'Server implementation')
 * May require more complex configuration (client and server side) and
   possibly CAS proxying (which may increase CPU load, see section
   'Endpoints/configuration')

In general I prefer the generic service as this makes it easier to
extend the client or support other clients with slightly different
feature sets. However, if there is a need for Artifact-specific logic
on the server side, it probably makes sense to have an Artifact service
instead of the generic ReferenceStorage service. However, we should
have a clear motivation for this. What do you see as main reason for
this change?

Change serialization format for artifacts
-----------------------------------------
Replacing the YAML metadata stored in multiple files with a single
protobuf probably makes sense, also for performance reasons. Whether we
store the main directory digests in a protobuf field or in a generic
Directory doesn't make a big difference in practice, in my opinion,
also see below with regards to class structure in BuildStream code.

With the current generic service using a regular Directory makes sense,
but if we anyway move to BuildStream-specific services, storing the
main directory digests in Artifact proto fields probably makes more
sense indeed.

Given this is a fundamental change in how artifacts are stored, we *should*
convert old artifacts over. [...]

Given that we occasionally still increase the core artifact version,
i.e., we haven't committed to artifact cache stability yet, I don't
think it makes much sense to support migration at all. We should keep
supporting both protocols on the server (at least for a while) to make
it easier for projects/clients to migrate, though.

I'd expect that rather than dealing with the raw Artifact proto objects
throughout the codebase, we'd instead create a CachedArtifact class which
mirrors the behaviour of the Element class (albeit deliberately lazily) and
handles the creation and management of artifacts in the local cache.  All the
rules regarding optionality of content etc will be handled entirely within that
class, with clean interfaces for detecting the presence/absence of content in
the rest of the code.

That sounds sensible.


Implications to existing code
-----------------------------

There are any number of implications to this proposal, in terms of the amount
of work to implement, but I think it's worth first hilighting the code which
I think would become simpler to deal with subsequent to such work.

We spend a lot of cognitive load (and code complexity) dealing with the
optionality of build trees.  This was predicted and accepted as a consequence
of the work, but it is in fact quite a large amount of work to manage and deal
with.  By moving build tree optionality out of the CAS' Directory structure, we
make it significantly easier to reason about the presence/absence of it and
will clean up a large number of codepaths which currently have to special case
top level Directory objects for artifacts.

The above paragraph can be summed up as "Placing semantics on otherwise generic
data structures results in higher cognitive load than specialised data
structures", or perhaps even more simply as "It's why we design classes rather
than store everything in dicts"

While I agree that it makes sense to have a specialized class for this,
I disagree with this being closely connected to whether we store
everything in a generic Directory or whether we use our own Artifact
proto for this. The latter is just a serialization format. The
CachedArtifact class will anyway completely hide the serialization
format, so the API of that class could look the same either way. I.e.,
while I think introducing such a class makes sense, I don't see this as
relevant motivation for moving to the Artifact proto.

Server implementation
---------------------
The ReferenceStorage service is currently implemented in BuildStream's
own bst-artifact-server (simple filesystem-based implementation) as
well as in BuildGrid (where the idea is to have a more scalable
implementation with other storage backends such as S3).

By replacing the generic ReferenceStorage service with BuildStream-
specific services, I suspect that the BuildGrid project may not want to
implement these anymore.

If that's the case, what's the plan for a more scalable implementation?
Is the plan to add S3 etc. support to BuildStream's artifact server,
duplicating storage backends from BuildGrid? Or a new project for that?
If this will not be part of BuildStream itself, releases will have to
be coordinated as new client features may require new server
versions. Any possibility we can do this without duplicating BuildGrid
work?

Endpoints/configuration
-----------------------
If BuildGrid won't implement the new BuildStream-specific services,
we'll need a way to use these new separately developed services with an
existing CAS server (BuildGrid or other servers), assuming we still
want to support BuildGrid and other CAS servers to store the actual
contents of BuildStream artifacts.

Have you already thought about how to handle this? The main options I
see are to support configuring two separate endpoints in BuildStream or
proxying CAS requests from the artifact server to the real CAS server.

Cheers,
Jürg



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]