Re: [BuildStream] Proposal: Artifact as a Proto



On Tue, Jan 08, 2019 at 12:43:23 +0100, Jürg Billeter wrote:
Thanks for writing this up.

No problem, and thanks for your thoughts on it.

Current artifacts are an evolved form of the artifact structure pre-CAS.  The
'ref' is the Digest of a Directory which reprepsents the artifact as it used to

Just to have clear terminology. The 'ref' is not a digest, it's a name
that _points_ to the Digest of a Directory. I.e., similar to git/ostree
refs except that there are no commit objects.

Yes, though in this context it was more that a 'ref' addresses a Digest object
which points at a Directory, rather than that the ref addresses a Directory
which was the distinction I was trying to ensure was understood.

The précis proposal
-------------------

In brief, I propose replacing all of the metadata files, and the top level
Directory with an Artifact proto.  It would look something like this:

As I see it, there are two aspects that may make sense to be discussed
somewhat separately:

 * Replace the generic ReferenceStorage service
 * Change serialization format for artifacts

Replace the generic ReferenceStorage service
--------------------------------------------
[snip]

Disadvantages:
 * General feature enhancements in the future: Server may need to be
   updated before updated client can be (fully) used. This can slow
   down feature development/deployment.

Yes, although I'd turn this on its head and say it's an advantage because it
means we'll think exceedingly carefully about the semantics of a mandatory
change in the artifact format, and we will be forced to support older versions
of the artifact format which will improve compatibility over time.

 * Specific example: Server needs to be extended/updated when
   introducing the SourceCache feature

I see no issue here because I don't think the SourceCache should use the same
ref service anyway.

 * May lead to duplication of work for scalable server implementation
   (see section 'Server implementation')
 * May require more complex configuration (client and server side) and
   possibly CAS proxying (which may increase CPU load, see section
   'Endpoints/configuration')

CAS proxying could be avoided if endpoints are split as you note later.

In general I prefer the generic service as this makes it easier to
extend the client or support other clients with slightly different
feature sets. However, if there is a need for Artifact-specific logic
on the server side, it probably makes sense to have an Artifact service
instead of the generic ReferenceStorage service. However, we should
have a clear motivation for this. What do you see as main reason for
this change?

My concern with the generic service is that the semantics of the components
of an artifact, their presence or absence, under various conditions, etc, are
*implicit* in the generic structure rather than being *explicit* under a well
defined proto whose documentation makes it very clear what it means if the
buildtree value is set or unset for example.

Change serialization format for artifacts
-----------------------------------------
[snip]

Given this is a fundamental change in how artifacts are stored, we *should*
convert old artifacts over. [...]

Given that we occasionally still increase the core artifact version,
i.e., we haven't committed to artifact cache stability yet, I don't
think it makes much sense to support migration at all. We should keep
supporting both protocols on the server (at least for a while) to make
it easier for projects/clients to migrate, though.

If we're happy to alienate all of our current users then fine.  I was figuring
that at least one release where we can cope with the 1.2 artifact format as
well as whatever we make going forward would be sensible.

We spend a lot of cognitive load (and code complexity) dealing with the
optionality of build trees.  This was predicted and accepted as a consequence
of the work, but it is in fact quite a large amount of work to manage and deal
with.  By moving build tree optionality out of the CAS' Directory structure, we
make it significantly easier to reason about the presence/absence of it and
will clean up a large number of codepaths which currently have to special case
top level Directory objects for artifacts.

The above paragraph can be summed up as "Placing semantics on otherwise generic
data structures results in higher cognitive load than specialised data
structures", or perhaps even more simply as "It's why we design classes rather
than store everything in dicts"

While I agree that it makes sense to have a specialized class for this,
I disagree with this being closely connected to whether we store
everything in a generic Directory or whether we use our own Artifact
proto for this. The latter is just a serialization format. The
CachedArtifact class will anyway completely hide the serialization
format, so the API of that class could look the same either way. I.e.,
while I think introducing such a class makes sense, I don't see this as
relevant motivation for moving to the Artifact proto.

I was more drawing a parallel between "everything is a dict vs. nice classes"
and "everything is Directory/Blob vs. ArtifactProto".  I understand and accept
that it's "just a serialization format" though because it's a format whose
semantics are more closely associated with RE than with Buildstream directly, I
feel like it's a not entirely unfair comparison to make.

Server implementation
---------------------
The ReferenceStorage service is currently implemented in BuildStream's
own bst-artifact-server (simple filesystem-based implementation) as
well as in BuildGrid (where the idea is to have a more scalable
implementation with other storage backends such as S3).

By replacing the generic ReferenceStorage service with BuildStream-
specific services, I suspect that the BuildGrid project may not want to
implement these anymore.

If that's the case, what's the plan for a more scalable implementation?
Is the plan to add S3 etc. support to BuildStream's artifact server,
duplicating storage backends from BuildGrid? Or a new project for that?
If this will not be part of BuildStream itself, releases will have to
be coordinated as new client features may require new server
versions. Any possibility we can do this without duplicating BuildGrid
work?

One option is to proxy from our implementation to a remote CAS endpoint.

Another is for BuildStream to consume BuildGrid's implementation of storage and
work with that project to produce an independent library which can be used by
both projects.

Finally we could accept that actually there's two different semantic purposes
at work here - BuildGrid's CAS is to be optimised for execution of build
operations across a grid of computers, where the results need live only as long
as they are anchored by ActionResult objects; vs. BuildGrid's desire to have
long-life content which is considered important in a way in which intermediate
build results in a grid are not.

Endpoints/configuration
-----------------------
If BuildGrid won't implement the new BuildStream-specific services,
we'll need a way to use these new separately developed services with an
existing CAS server (BuildGrid or other servers), assuming we still
want to support BuildGrid and other CAS servers to store the actual
contents of BuildStream artifacts.

Have you already thought about how to handle this? The main options I
see are to support configuring two separate endpoints in BuildStream or
proxying CAS requests from the artifact server to the real CAS server.

As detailed above, there are a number of options.  I would say that my
preferred approach would be to work with the BuildGrid project to abstract
out the storage classes into a library to be used by both projects, rather
than to expect both projects to properly, yet independently, implement the
same protocol and semantics.  Such an activity may also result in better
separation of concerns and a better understanding of what it means for
an artifact server to exist and provide content to BuildStream.

There's no need, for example, for the *local* cache to have the exact same
semantics as a remote cache, though I'd expect them to be the same short-term
for obvious ease-of-implementation reasons.

D.

-- 
Daniel Silverstone                          https://www.codethink.co.uk/
Solutions Architect               GPG 4096/R Key Id: 3CCE BABE 206C 3B69


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]