Re: [BuildStream] Source cache plan

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Raoul Hidalgo Charman <raoul hidalgocharman codethink co uk>, buildstream-list gnome org
Subject: Re: [BuildStream] Source cache plan
Date: Wed, 06 Feb 2019 18:46:32 +0200

Hi Raoul,

So there was some movement on this thread, and I mostly agree with what
has been said by Jürg and Sander so far in the replies.

I will just add a couple of comments here to the original mail...

On Thu, 2019-01-17 at 12:56 +0000, Raoul Hidalgo Charman via BuildStream-list wrote:

Hi everyone,

[...]

I suggest that the 'SourceCache' class be part of the context (similar 
to how
'ArtifactCache' is), and contains config related to source cache such as
which remote(s) to use and the local CAS object. When the element class 
deals
with sources it can now do it via the source cache rather than directly 
calling
source methods.


The artifact cache appeared on the `Context` object for some reason,
originally it was a part of the `Platform`.

The `Context` is really intended to represent the conditions under
which BuildStream was launched (i.e. all user preferences) - it is also
the focal point of logging but perhaps this could also be moved.

I consider it very strange that accessing this new @property on
`Context` results in the creation of the artifact cache, which I think
is a completely separate part of the core (while also being a
singleton).

That said, it would make sense to do SourceCache in the same way as
ArtifactCache, and untangle that separately so that `Context` does not
snowball into everything and the kitchen sink...

So far I think the source cache needs the following API which all take a 
source
object as an argument:
* get_consistency: Returns the sources consistency, I propose that the
   'Consistency' type has additional field 'STAGED', which corresponds 
to the
   source being staged in the CAS but not the unstaged source in sourcedir.
* fetch: depending on the consistency, will fetch from remote CAS or using
   source plugin.
* push: during the push stage, check if the remotes have the ref and if 
not push
   the staged source.
* stage: Also taking a virtual directory object, depending where this 
needs to
   be staged, this may involve importing a CAS based directory, or 
copying the
   staged file into a directory.
* init_workspace: This requires an unstaged source, and so may require
   fetching the source if the consistency is just 'STAGED'.



Interesting, Sander raises a point here that we can optimize in the
cases where the "staged source" is quite identical to the actual
backing data (like a tarball).

I'll paste your reply here to keep it all in one email:

  "Ah, yeah it isn't necessary for some other sources, the default for
   init_workspace being the sources stage method. I'm not sure I follow
   the upgrading the staged source, we probably don't want to replace
   the staged source with the full git repo as we don't want this when
   building elements."

Ummm, maybe I'm misreading you but... Yes, we definitely want the
workspace of a git Source to be the checkout of the given git
repository including it's history, that is expected.

What I am guessing about this "upgrade" idea of Sanders is that it
might be a redefinition of the Source.init_workapace() API, such that
the Source would expect a staged tree to *augment* with VCS history and
the like, instead of expecting an empty directory - it would seem that
if we start creating workspaces directly from the `SourceCache` at all,
then organizing the code in this way might be more easy to read, the
init_workspace() plugin codepath would only be the final, extra step.

There are some additional questions that warrant discussion:
* Do we want to use the same remote CAS as the artifact CAS, or should 
the user
   configure both separately?
* Should the unstaged source also be pushed to the remote CAS, or an 
option to
   allow this? This would prevent additional fetching when a user wants to
   initialise a workspace.


Others have replied, I agree we should not replicate entire VCS
histories into a remote CAS, at least to start.

* A sources '_preflight' isn't necessary if we only require the staged 
sources.
   Should this check be removed in the cases where it's not needed?


This is a hard problem.

You have replied to this thread saying that we cannot stop preflighting
sources because "we might be tracking", of course we have to preflight
the sources we intend to track, *if* we intend to track them.

I think that the right way forward is definitely to keep preflighting
unconditionally at first and not block `SourceCache` work on perfecting
this detail, but it doesnt answer the question of what to do at that
point (we can figure it out then).

Problems here as I see them are:

  * If you preflight, then you require any host tools the source would
    use.

  * If you get the sources from a `SourceCache`, you don't need to
    install those host tools.

  * If you fail to download the source from `SourceCache`, it is very
    very annoying to fail because of lacking host tools *late* in the
    process.

    Maybe this point on it's own is strong enough to just say that
    "We must preflight in any scenario which *might* involve using
    Source APIs directly". Better to install git and not need it, than
    to find out hours later that a build stopped due to lack of git.

Cheers,
    -Tristan

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]