Re: [BuildStream] [PROPOSAL] Adopt Remote Asset API



TL;DR
- General agreement
- No rush on the 2.0 release

More inline.

On Thu, Feb 13, 2020 at 11:03 AM Jürg Billeter <j bitron ch> wrote:
Hi Sander,

Thanks for your work on the Remote Asset API and writing this proposal.
Overall this definitely makes sense to me.

Thanks, that's good to hear.

On Mon, 2020-02-10 at 22:45 +0100, Sander Striker via buildstream-list
wrote:
> TL;DR
> In January the Remote Asset API[1] landed in the Remote APIs
> repository.  The API opens up a number of opportunities for
> standardization and consolidation.
>
> I propose we:
> - Retire Artifact Cache, use a Remote Asset API based Asset Cache
> instead
> - Retire Source Cache, use a Remote Asset API based Asset Cache
> instead

Yes, it will be good to drop the requirement to deploy BuildStream-
specific servers.

> - Introduce caching of individual sources, using a Remote Asset API
> based Asset Cache
> - Introduce tracking of individual sources, using the Remote Asset
> API

A few questions/comments regarding the server implementation:
 * As far as I know, there is no server implementation yet, however,
   there are plans. Is there already a web page detailing these plans
   or even a repository?

I don't think these plans are outlined anywhere yet.  I do expect it to be considered in BuildGrid at least.
 
 * Will the planned server be suitable for use in the BuildStream test
   suite as well?

Can you enlighten me a bit on what would make it suitable?
 
 * We probably want a buildbox-casd-like local proxy for this, see also
   #1064. That proxy could double as a simple local server for testing
   (and possibly small-scale server deployments) as well, similar to
   buildbox-casd being usable without a remote. In that case, our focus
   should likely be on the local proxy and we wouldn't be blocked by
   the "real"/scalable server implementation.

That is indeed the direction I was thinking of.  Ideally there is no casserver.py in the repo anymore, and we can rely on the buildbox-*d implementation(s), which I expect to indeed be simple [caching] proxies.
 
> ## Retire Artifact Cache, use a Remote Asset API based Asset Cache
> instead
>
> Instead of a dedicated ArtifactService, we can leverage the Remote
> Asset API FetchService and PushService.  We will retain the Artifact
> message proto to describe an artifact.  However, we will associate it
> remotely via PushService.PushBlob:
> - PushBlobRequest.uris is
> [ARTIFACT_URI_TEMPLATE.format(Artifact.strong_key),
>                         
>  ARTIFACT_URI_TEMPLATE.format(Artifact.weak_key)]
> - PushBlobRequest.blob_digest is the digest of the Artifact message.
> The Artifact message will need to be stored in CAS separately.
> - PushBlobRequest.references_directories is [Artifact.files,
>                                              Artifact.logs.digest,
>                                              Artifact.buildtree,
>                                              Artifact.sources]
> depending on lifetime requirements.
>
> Similarly, we will retrieve Artifacts using FetchService.FetchBlob:
> - FetchBlobRequest.uris is ARTIFACT_URI_TEMPLATE.format(cache_key).
> The Artifact can be retrieved from CAS at
> FetchBlobResponse.blob_digest.

This sounds reasonable. Same for the similar source cache replacement.

> ARTIFACT_URI_TEMPLATE could be defined as
> "urn:buildstream:artifact:{}".  However this would require
> registration with IANA, see
> https://www.iana.org/assignments/urn-namespaces/urn-namespaces.xhtml.

Could it make sense to use "urn:fdc:buildstream.build:2020:" as prefix
instead?

That looks very feasible indeed.  Good find.

> ## Introduce caching of individual sources, using a Remote Asset API
> based Asset Cache
>
> We can reduce the load and reliance on additional services by
> leveraging the Remote Asset API.  CAS and Remote Asset API combined
> can serve as a cache of the content, with the additional benefit that
> having the content in CAS means it can be referred to in Remote
> Execution without additional uploads.
>
> To support this we need to extend the Source Plugin API to return the
> list of URIs and qualifiers as needed by the FetchService.
> Specifically:
> - FetchDirectoryRequest.uris is the complete set of URLs that
> represent the content of the source.  This is the full set after
> alias expansion.  For example: git.example.com/foo/bar.git and git-
> mirror.example.com/foo/bar.git.
> - FetchDirectoryRequest.qualifiers is the minimal set of qualifiers
> that uniquely identifies the source.  It's expected to be closely
> tied to the value of `ref`.  For example:
>   - vcs.commit = b5123b1bb2853393c7b9aa43236db924d7e32d61
>   - resource_type = application/x-git

Do we want to replace the existing `ref` mechanism to be suitable for
both mechanisms or do we want mostly independent API methods for the
two?

I think we need to consider the options here.  The `ref` mechanism may not be suitable for both.  I'm open to suggestions here. 
 
> We will need to expand the Source Plugin API similarly to return the
> list of URIs an qualifiers as needed by the PushService. 

Based on your examples the qualifiers used for push is a combination of
the qualifiers used for fetching and tracking. Can we design the API in
such a way that plugins don't have to duplicate the common parts?

Sounds like a good idea.
 
> Behavior should be configurable to support the following use cases:
> - client does *not* use the Remote Asset API to fetch sources, and
> only uses the source plugin native fetch
> - client uses the Remote Asset API to fetch sources, and falls back
> to using the source plugin native fetch
> - client uses the Remote Asset API to fetch sources, and does *not*
> fall back to using the source plugin native fetch
> - client uses the Remote Asset API to push sources, after using the
> source plugin native fetch
> - client does *not* use the Remote Asset API to push sources

At first glance the number of possible behaviors seems rather high.
However, splitting the configuration up into a couple of orthogonal
knobs might make this simple enough.

I think that is the best approach.  I think the default would be:
- client uses the Remote Asset API to fetch sources, and falls back to using the source plugin native fetch
And that would be sufficient to start with.
 
# Implementation plan

Implementing the whole proposal will likely take some time, especially
if we include the effort for the local proxy. I would suggest focusing
on retiring artifact and source caches and deferring caching/tracking
of individual sources until the former is ready. Or do you think it's
important to tackle everything right away?

I suggest we do it incrementally, in the order you proposed.
 
We may also want to keep the release of BuildStream 2.0 in mind. We
most likely don't want to release 2.0 with the current artifact/source
cache and then drop it soon after. However, caching and tracking of
individual sources is more like an extension and could probably be
added after 2.0. That said, fleshing out the source plugin API before
2.0 would be useful.

I see no reason to rush the 2.0 release.  I prefer for us to land the source plugin API changes ahead of that.  I don't think the implementation would hold things up massively beyond that.
 
Cheers,
Jürg

Cheers,

Sander
 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]