Re: Feature proposal: multiple cache support V2



On Fri, 2017-11-03 at 19:14 +0000, Sam Thursfield wrote:
Sorry for the delayed response here :-)

Same here, I seem to have let this one slip under the mountain of other
things I've been thinking about.

Overall the proposal looks relatively sane, agree about recent comments
about priority; however I'm not sure that the priority is as meaningful
as one might think.

In the expected case that this was originally conceived for (at
GUADEC), one might have a team of developers diverge from mainline and
work on some branch; in which case for the duration of the work in
progress, one would want:

  o Default the push URI to a separate cache, work in progress
    artifacts for that branch end up in a separate "devel" cache

    This means only diverging artifacts normally get pushed to
    the "devel" cache, not much redundancy here, the "devel" artifacts
    get pushed to the "devel" cache

  o Allow pulling from both caches

    In this case, normally "stable" artifacts from mainline (cache keys
    which have not diverged because of ongoing development) are usually
    found in the "stable" cache.

    Other "devel" artifacts are pulled from the "devel" cache, when
    those cache keys have diverged from "stable" mainline.

So yes, the order being significant is important, however not *that*
important afaics, because artifacts which have diverged from mainline
are anyway only available in the diverged cache (and whether you
download the artifact from one cache or another is pretty
insignificant, if the cache key exists, it should be what you expect it
to be).

Few more comments in line:

On 20/10/17 15:13, Sander Striker wrote:
On Mon, Oct 16, 2017 at 4:28 PM Sam Thursfield <
sam thursfield codethink co uk> wrote:

    * make our ssh:// cache URLs usable for pulls as well as
pushes


I'll have a look at #112 why ssh:// URLs specifically.

I presume you already did; but in case not: OSTree pushes require two 
way communication, and need to be behind some kind of authentication so 
that there is access control. Currently SSH is the simplest solution. 
There are other options of course, HTTPS comes to mind although the fact 
that our push protocol is stateful makes that harder than it is for Git.

To add extra clarity here, for #112 it is important that we use a
single URI scheme to address an artifact cache, for multiple reasons:

  a.) It is currently confusing that we need 2 URIs, it leads one to
      suspect that one can configure the push and pull URIs to separate
      caches, this is simply not allowed.

  b.) For future proofing of the API, having to support this 2 URI
      scheme on the artifact server side is burdensome, it allows for
      less flexibility to refactor and improve the artifact server
      code without breaking user facing configuration API.

Finally, in a similar way to how git works, it is not important to have
every URI scheme allow for both read and write access to the remote
cache. If a given URI scheme only supports read-only access, this is
fine, however if a given URI scheme supports write access, I think it
should also support read access.

    * change the way projects and users specify artifact caches, so that
      each entry is just single (canonical) ssh://, http:// or https://
      URL instead of having `pull-url` and `push-url` pairs.

    * allow projects and users to specify multiple artifact caches in a
      list

    * make pipelines pull artifacts from any cache that has a given
      artifact available, in a 'priority' order

    * add `bst pull --cache=URL` and `bst push --cache=URL` option to
      allow pushing to arbitrary caches

Specifying the URI on the command line is interesting especially for
push and pull commands, but we need to be careful about this.

Since this is related to other discussion, I'll elaborate further
below...


    * add a `--cache-timeout` configuration option to control how long
      BuildStream waits for a cache to respond before considering it
      unreachable

I'd like to jump in here with a nitpick:

There are a variety of reasons why BuildStream accesses the network,
and I dont think it's sane to introduce use-case specific options for
each one separately.

I would rather introduce a single option for network timeouts in
general.


Can we detect network being unavailable or do we need to try all cache
entries with a timeout?

I'm not sure how to detect network availability in a platform-agnostic 
way. With my GNOME hat on I'd ask NetworkManager, but that doesn't sound 
like the right approach here.

It also doesn't solve the case of partial connectivity, such as 
restrictive firewalls which may allow access to one cache but silently 
block connections to others.

Proposed changes
----------------


...

Users and projects will specify caches in the same places as before.
However in each place an ordered list of URLs for different caches
will be allowed

For example, the GNOME SDK project could specify this in their
`project.conf` file:

       artifacts:
         - https://sdk.gnome.org/cache-releases
         - https://sdk.gnome.org/cache-latest

The order is significant. In this case 'cache-latest' has higher
priority than 'cache-releases' as it is listed afterwards. Thus if an
artifact is in 'cache-latest' it will always be pulled from there, not
from 'cache-releases'.


Just wondering if this is the right ordering interpretation or whether we
have it reversed.  If we assume the order to be the order of preference, it
means we should try the first entry, if not present try the second, and so
forth.

Yes, on second thoughts my proposal is a bit backwards :-) Higher in the 
list should mean higher precedence.

...

There will no longer be a way to remove a cache from the list of
configured caches. BuildStream will try to contact each cache on
startup and any that do not respond within a given timeout will
be considered unreachable. We can't anticipate a timeout value that will
work well in all situations so we will make it configurable through a
`--cache-timeout` option.


Is a --cache="" sufficient to disable fetching from caches?

I wasn't planning on having `bst build --cache...` at all. There are 
ways it could make sense but i'm not entirely sure what is the best way 
for that to interact with the multiple cache support. I.E if the use 
specifies a cache on the commandline, does that replace those from their 
user config, or replace those from the user config *and* the project 
config, or is just added to the list of caches that we fetch from .. ?

Specifying cache URIs on the command line, yes, good convenience, but
I'm a little bit worried about where this might lead, needs a bit of
thought.

Originally, the cache URIs were user configuration only, but this was
problematic because the same user might work on multiple projects on
their computer, and it can be dangerous to assume that artifacts
originating from one project can be pushed to a cache used for another
project; I.e. consider a user who works on both proprietary and free
software projects, publishing proprietary artifacts to public caches is
something BuildStream should adamantly disallow at least as default
behavior.

While this seems to be orthogonal to the proposal on the table right
now, it is only because we have not yet merged Jürg's work on inter
project dependencies *yet*.

So, in a near post 1.0 release future, a single `bst build` session can
and will include multiple projects in the same pipeline, I think the
sane default in this scenario will be to pull/push artifacts to their
respective artifact caches declared in their respective project
configurations.

I'm not saying I know what the answer is here, but it seems that a
global --cache option to BuildStream's CLI is at odds with pipelines
constructed from multiple projects - and this proposal should be taking
that into account.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]