Re: Source distribution, WAS: Re: Proposal for Remote Execution



On Wed, 2018-05-30 at 11:00 +0200, Sander Striker wrote:
Hi,

This post is not really commenting to the approach we're taking to
implement remote execution, but rather what it enables us to do in
terms of source distribution.

This sounds attractive, also note there has been a writeup at:

     Should we use a CAS-based source cache as a mirror?
     https://gitlab.com/BuildStream/buildstream/issues/418

Good question.

[...]
Staging sources into CAS is problematic if e.g. the whole .git
directory is
included. We should avoid this where possible. This is already a
concern
for caching of build trees (#21) as well and can be improved on
independently of any other steps.

If i understand this correctly, we have an opportunity for simplified
source "mirroring" here.

If we introduce a SourceCache, which maps a sourcekey to a CAS
directory node, the fetch operation becomes:
1) lookup SourceKey in SourceCache
   1a.1) when an entry is present, fetch the Directory nodes from CAS
and store them in local CAS.
   1b.1) when no entry is found, fetch the source in the traditional
sense
   1b.2) stage the source in a temporary location (assumes #376 is
resolved)
   1b.3) put the staged source into the local CAS
      iff you have write permissions to SourceCache:
   1b.3) upload the staged source to CAS
   1b.4) put an entry into SourceCache

This does optimize for remote execution, in the sense that actual
fetching of the source as well as fetching the files that make up the
source is avoided.
I could envision a configuration option, allowing the user to say:
- I am always building locally, make sure that my local CAS has
everything I need to build locally.
- I want the actual sources locally, make sure to always fetch in the
traditional sense

In terms of source mirroring.  This could now be a central instance
of bst that is just running bst fetch.  It will have all of the
fetched sources locally, in case they need to be inspected.  All
other instances will pull from SourceCache as CAS.

Not entirely, I think you want the sources to also have been *staged*
in the way they would be used by their elements, to commit the staged
results to CAS; rather than committing an entire git repo or a still
packed up tarball (at least, this is how I'm reading the intentions).

We'd probably want this mirrored blob to be addressable by the element
and it's source configurations; i.e. the blob is one build directory
after all "Source" objects have unpacked what they want.

This blob is also going directly into the artifact; i.e. see:

    Caching of build trees
    https://gitlab.com/BuildStream/buildstream/issues/21

So we might avoid redundancy when an artifact cache server is also a
source cache server...

The above would address fetching sources reliably.  It would also
address:
#261: Investigate the use of git shallow clones (to build instead
tarballs)
#330: Support for mirroring of upstream sources
  To an extent, as the original format of the sources are not
propagated beyond the host running the "mirror".  However, without
the need to set up anything in terms of serving the sources in their
original format.
#328: Support for downloading sources from mirrors
  It covers the case of getting source from a local ecosystem; at
least for anything recent.  Geographical awareness will need to be a
higher level concept, that applies to endpoints like ArtifactCache,
SourceCache, CAS, etc.


The lifespan of SourceCache/CAS entries might be limited.  This can
be mitigated by keeping an archive of the original .bst files, and
the SourceCache/CAS entries, such that there is always a way to go
back [years] in time.  Without even having to worry to much about the
host tools (git, bzr, etc).

Opening a workspace will still require the traditional source fetch. 
This should happen on-demand if there is no source present locally. 
Alternatively the user could force a fetch.  This would be an action
a user would perform if e.g. when preparing to be offline.

Thoughts?

I have some concerns.

It feels less robust as we are not saving the source for posterity in
it's original format, I don't think we should trust the cache for
important things we want to store, and we should use something
explicitly purposed for that.

As you highlight above, even if we did go the extra mile to ensure that
the Source Cache is persistent for the sources we will need, there are
still places where we expect the original format to be available, which
mirrors should ensure (like workspaces).

Basically, I think that "trusting the intermediate source cache
designed for remote execution to consequently solve source mirroring"
is the wrong way to think about this, we are probably looking at two
potentially useful, but separate things:

  o Sharing the SourceCache

    - This is still just a "cache"
    - This contains only the unpacked/staged sources
    - Sharing this means I dont need the whole git locally,
      if a shared source cache already has exactly this

  o Source Mirroring

    - I don't worry about third party sites going missing
    - I can reproduce this forever, until I delete my mirror(s)
    - I probably want the original sources in their original formats

I don't think that Sharing a SourceCache is really going to solve the
problem of Source Mirroring, but there is no reason why Source
Mirroring could not be implemented using CAS technology also (and the
implementation possibly simplified by this ?).

Also there seems to be no reason for Sharing a SourceCache to be a bad
idea in a scenario where Source Mirroring is also available (the
SourceCache will act mostly as an optimized hot cache of things people
have been building lately).

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]