Re: [BuildStream] Source cache plan

From: Sander Striker <s striker striker nl>
To: Raoul Hidalgo Charman <raoul hidalgocharman codethink co uk>
Cc: BuildStream <buildstream-list gnome org>
Subject: Re: [BuildStream] Source cache plan
Date: Fri, 18 Jan 2019 23:29:02 +0100

Hi,

On Thu, Jan 17, 2019 at 1:57 PM Raoul Hidalgo Charman via BuildStream-list <buildstream-list gnome org> wrote:

[...]

Source cache plan
-----------------

Source cache, originally raised in [4], will use both local and remote
CAS's to
store staged sources, preferentially trying to fetch from the remote
cache(s),
and if not present, fetch from the actual source.

I suggest that the 'SourceCache' class be part of the context (similar
to how
'ArtifactCache' is), and contains config related to source cache such as
which remote(s) to use and the local CAS object. When the element class
deals
with sources it can now do it via the source cache rather than directly
calling
source methods.

So far I think the source cache needs the following API which all take a
source
object as an argument:
* get_consistency: Returns the sources consistency, I propose that the
'Consistency' type has additional field 'STAGED', which corresponds
to the
source being staged in the CAS but not the unstaged source in sourcedir.

I am not sure if it is clear what this means. Given that I needed to re-read this to come up with my interpretation, it might be best to clarify. Specifically "the source being staged in the CAS" part. Is this state essentially mapping to having the staged sources for the element (identified by a source cache key) available in CAS?

* fetch: depending on the consistency, will fetch from remote CAS or using
source plugin.

This seems to fall into the same trap as we had with artifacts. If we're staging for remote execution the only thing we're interested in fetching is the Tree representing the staged sources, but not the blobs (read: files).

At the end of the fallback to fetching from the actual source repositories, can we expect the staged sources to exist in CAS?

Do we need 2 separate calls here, one to just grab sources from CAS, and another to grab sources from the repositories?

* push: during the push stage, check if the remotes have the ref and if
not push
the staged source.

So really what this is doing under the hood is calling FindMissingBlobs on the remote CAS, uploading any missing blobs, and then calling the SourceCache service to put in a Source for a given source cache key?

* stage: Also taking a virtual directory object, depending where this
needs to
be staged, this may involve importing a CAS based directory, or
copying the
staged file into a directory.

I'm not sure I follow.

* init_workspace: This requires an unstaged source, and so may require
fetching the source if the consistency is just 'STAGED'.

That's actually interesting. I don't think it does mean that necessarily. It may do for the git source case, and even in that case you could argue that using the staged source, and then _upgrading_ that to be a checked out clone of a git ref (this can be deferred as a later optimization, granted). In the case of say a tar gz there should be no difference between staged and unstaged source (again this can be deferred as a later optimization - the 'upgrade' being effectively a no-op). Granted, his all gets a bit tricky when we're dealing with multiple sources for a single element...

There are some additional questions that warrant discussion:
* Do we want to use the same remote CAS as the artifact CAS, or should
the user
configure both separately?

We should allow for configurability. However, if we would defer it to be the same as the default, I think that is fine.

* Should the unstaged source also be pushed to the remote CAS, or an
option to
allow this? This would prevent additional fetching when a user wants to
initialise a workspace.

The unstaged source? I would say, probably not.

* A sources '_preflight' isn't necessary if we only require the staged
sources.
Should this check be removed in the cases where it's not needed?
Some of these points may be optimisations that can be added later.

The artifact as a proto proposal [5] may also affect this, if reference
services
are to distinguish between artifacts and directory objects, but I think this
also makes sense to be added later if this goes ahead.

I would prefer to start with stronger semantics here.

Points and criticisms appreciated.

Cheers,
Raoul

Cheers,

Sander

[1] https://gitlab.com/BuildStream/buildstream/merge_requests/1013
[2] https://gitlab.com/BuildStream/buildstream/merge_requests/1071
[3] https://gitlab.com/BuildStream/buildstream/issues/870
[4] https://gitlab.com/BuildStream/buildstream/issues/440
[5]
https://mail.gnome.org/archives/buildstream-list/2019-January/msg00013.html

_______________________________________________
BuildStream-list mailing list
BuildStream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list

Follow-Ups:
- Re: [BuildStream] Source cache plan
  - From: Jürg Billeter
- Re: [BuildStream] Source cache plan
  - From: Raoul Hidalgo Charman

References:
- [BuildStream] Source cache plan
  - From: Raoul Hidalgo Charman

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]