Re: Feature proposal: multiple cache support V2

From: Sander Striker <s striker striker nl>
To: buildstream-list gnome org
Subject: Re: Feature proposal: multiple cache support V2
Date: Mon, 06 Nov 2017 13:58:15 +0000

On Mon, Nov 6, 2017 at 9:17 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:

On Fri, 2017-11-03 at 19:14 +0000, Sam Thursfield wrote:
> Sorry for the delayed response here :-)

Same here, I seem to have let this one slip under the mountain of other
things I've been thinking about.

Overall the proposal looks relatively sane, agree about recent comments
about priority; however I'm not sure that the priority is as meaningful
as one might think.

In the expected case that this was originally conceived for (at
GUADEC), one might have a team of developers diverge from mainline and
work on some branch; in which case for the duration of the work in
progress, one would want:

o Default the push URI to a separate cache, work in progress
artifacts for that branch end up in a separate "devel" cache

This means only diverging artifacts normally get pushed to
the "devel" cache, not much redundancy here, the "devel" artifacts
get pushed to the "devel" cache

o Allow pulling from both caches

In this case, normally "stable" artifacts from mainline (cache keys
which have not diverged because of ongoing development) are usually
found in the "stable" cache.

Other "devel" artifacts are pulled from the "devel" cache, when
those cache keys have diverged from "stable" mainline.

So yes, the order being significant is important, however not *that*
important afaics, because artifacts which have diverged from mainline
are anyway only available in the diverged cache (and whether you
download the artifact from one cache or another is pretty
insignificant, if the cache key exists, it should be what you expect it
to be).

I guess that's true. I was thinking of the case where the 'stable' cache is more trusted than the 'devel' cache... not so sure that holds unless one is doing signing and the other isn't.

Few more comments in line:

> On 20/10/17 15:13, Sander Striker wrote:
> > On Mon, Oct 16, 2017 at 4:28 PM Sam Thursfield <
> > sam thursfield codethink co uk> wrote:
> > >
> > >     * make our ssh:// cache URLs usable for pulls as well as
> > > pushes
> > >
> >
> > I'll have a look at #112 why ssh:// URLs specifically.
>
> I presume you already did; but in case not: OSTree pushes require two
> way communication, and need to be behind some kind of authentication so
> that there is access control. Currently SSH is the simplest solution.
> There are other options of course, HTTPS comes to mind although the fact
> that our push protocol is stateful makes that harder than it is for Git.

To add extra clarity here, for #112 it is important that we use a
single URI scheme to address an artifact cache, for multiple reasons:

a.) It is currently confusing that we need 2 URIs, it leads one to
suspect that one can configure the push and pull URIs to separate
caches, this is simply not allowed.

b.) For future proofing of the API, having to support this 2 URI
scheme on the artifact server side is burdensome, it allows for
less flexibility to refactor and improve the artifact server
code without breaking user facing configuration API.

Finally, in a similar way to how git works, it is not important to have
every URI scheme allow for both read and write access to the remote
cache. If a given URI scheme only supports read-only access, this is
fine, however if a given URI scheme supports write access, I think it
should also support read access.

> > >     * change the way projects and users specify artifact caches, so that
> > > > > >       each entry is just single (canonical) ssh://, http:// or https://
> > >       URL instead of having `pull-url` and `push-url` pairs.
> > >
> > >     * allow projects and users to specify multiple artifact caches in a
> > >       list
> > >
> > >     * make pipelines pull artifacts from any cache that has a given
> > >       artifact available, in a 'priority' order
> > >
> > >     * add `bst pull --cache=URL` and `bst push --cache=URL` option to
> > >       allow pushing to arbitrary caches

Specifying the URI on the command line is interesting especially for
push and pull commands, but we need to be careful about this.

Since this is related to other discussion, I'll elaborate further
below...

> > >
> > >     * add a `--cache-timeout` configuration option to control how long
> > >       BuildStream waits for a cache to respond before considering it
> > >       unreachable

I'd like to jump in here with a nitpick:

There are a variety of reasons why BuildStream accesses the network,
and I dont think it's sane to introduce use-case specific options for
each one separately.

I would rather introduce a single option for network timeouts in
general.

> >
> > Can we detect network being unavailable or do we need to try all cache
> > entries with a timeout?
>
> I'm not sure how to detect network availability in a platform-agnostic
> way. With my GNOME hat on I'd ask NetworkManager, but that doesn't sound
> like the right approach here.
>
> It also doesn't solve the case of partial connectivity, such as
> restrictive firewalls which may allow access to one cache but silently
> block connections to others.
>
> > > Proposed changes
> > > ----------------
> > >
>
> ...
> > >
> > > Users and projects will specify caches in the same places as before.
> > > However in each place an ordered list of URLs for different caches
> > > will be allowed
> > >
> > > For example, the GNOME SDK project could specify this in their
> > > `project.conf` file:
> > >
> > >        artifacts:
> > >          - https://sdk.gnome.org/cache-releases
> > >          - https://sdk.gnome.org/cache-latest
> > >
> > > The order is significant. In this case 'cache-latest' has higher
> > > priority than 'cache-releases' as it is listed afterwards. Thus if an
> > > artifact is in 'cache-latest' it will always be pulled from there, not
> > > from 'cache-releases'.
> > >
> >
> > Just wondering if this is the right ordering interpretation or whether we
> > have it reversed.  If we assume the order to be the order of preference, it
> > means we should try the first entry, if not present try the second, and so
> > forth.
>
> Yes, on second thoughts my proposal is a bit backwards :-) Higher in the
> list should mean higher precedence.
>
> ...
> > >
> > > There will no longer be a way to remove a cache from the list of
> > > configured caches. BuildStream will try to contact each cache on
> > > startup and any that do not respond within a given timeout will
> > > be considered unreachable. We can't anticipate a timeout value that will
> > > work well in all situations so we will make it configurable through a
> > > `--cache-timeout` option.
> > >
> >
> > Is a --cache="" sufficient to disable fetching from caches?
>
> I wasn't planning on having `bst build --cache...` at all. There are
> ways it could make sense but i'm not entirely sure what is the best way
> for that to interact with the multiple cache support. I.E if the use
> specifies a cache on the commandline, does that replace those from their
> user config, or replace those from the user config *and* the project
> config, or is just added to the list of caches that we fetch from .. ?

Specifying cache URIs on the command line, yes, good convenience, but
I'm a little bit worried about where this might lead, needs a bit of
thought.

Originally, the cache URIs were user configuration only, but this was
problematic because the same user might work on multiple projects on
their computer, and it can be dangerous to assume that artifacts
originating from one project can be pushed to a cache used for another
project; I.e. consider a user who works on both proprietary and free
software projects, publishing proprietary artifacts to public caches is
something BuildStream should adamantly disallow at least as default
behavior.

While this seems to be orthogonal to the proposal on the table right
now, it is only because we have not yet merged Jürg's work on inter
project dependencies *yet*.

So, in a near post 1.0 release future, a single `bst build` session can
and will include multiple projects in the same pipeline, I think the
sane default in this scenario will be to pull/push artifacts to their
respective artifact caches declared in their respective project
configurations.

I'm not saying I know what the answer is here, but it seems that a
global --cache option to BuildStream's CLI is at odds with pipelines
constructed from multiple projects - and this proposal should be taking
that into account.

That's a good point. The current user configuration for artifact servers allows for both per project overrides (or additions as per this proposal?), as well as a global option.

Are you suggesting that we verify this was the user's intent the first time a project is being pushed to the globally set one in the user configuration?

Cheers,

Sander

Cheers,
-Tristan

_______________________________________________
Buildstream-list mailing list
Buildstream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list

Follow-Ups:
- Re: Feature proposal: multiple cache support V2
  - From: Tristan Van Berkom

References:
- Re: Feature proposal: multiple cache support V2
  - From: Sam Thursfield
- Re: Feature proposal: multiple cache support V2
  - From: Tristan Van Berkom

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]