Re: Discussion on source mirroring



Hi,

On Fri, Mar 16, 2018 at 7:30 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:
Thanks for the writeup !

On Thu, 2018-03-15 at 16:30 +0000, Jonathan Maw wrote:
> I've been giving some thought on source mirroring, recently, after 
> reading the discussion at 
> https://gitlab.com/BuildStream/buildstream/issues/179.
>
> Source mirroring will be valuable for us because:
> * The canonical upstream may disappear without warning
> * The canonical upstream may be slow to access due to limited 
> infrastructure or geographical distance.

* The organization may be mirroring everything in a local build farm
  - To be sure that their builds are repeatable in 10 years
  - To optimize fetch times on build machines
  - Without losing the information of what the original URL was

I have to say I got slightly confused here, as we seem to be talking about source mirroring yet at the same time about remote cache mirroring (the referred issue)?
 
> I briefly considered whether it's possible to do a "one size fits all" 
> source mirror, but I don't think it's doable.
> Each source is permitted to store files in the local source cache in 
> whatever format they feel is appropriate - as a result, merging the 
> remote and the local cache is dependent on which methods are suitable 
> for each kind of source.

I'm not sure what a "one size fits all" means here; so for the purpose
of this email, I'm going to interpret it as:

  "A solution which requires only that a project can provide a list
   of mirrors, expect that BuildStream has populated them, and walk
   the mirrors, regardless of what Source plugin type is in use"

"expect that BuildStream has populated them" -- I question this part at this stage.
 
With current Sources in place, we can /mostly/ achieve this with a
local source cache hosted on apache, such that implementing a mirror is
a matter of running `bst fetch` (or a similar new command) on a mirror
host periodically, on a list of projects.

I personally would like to see:

* A possibility to teach Sources some new tricks for this
  - May include caching differently, to ensure hosting over HTTP(s)
    works
  - Must include a new method for *pulling everything*, not just
    the requested tracking branch or ref

The "git" source already does a clone mirror on initial setup, right?  What mirroring actually looks like can be very source type and organization/project specific.
 
* A configuration setup which would allow getting sources from an
  alternative URLs

+1.
 
* Hopefully some comprehensive configuration API, with the expectation
  that both solutions are a possibility.

And I would like that the second approach, more similar to what you are
proposing here, is implemented first; without *requiring* an eventual
enhancement, but with forward thinking so that a one size fits all
mirroring solution is not *prohibited* in the future.

I think that is key.  
 
Basically, something similar to what you propose is more useful more
quickly, but lets not shut the door on an eventual, more complete
solution, lets ensure that we can do that as another step in the
future.

Note that your proposed implementation path already teaches a lot to
Sources, so we already cross a line where:

  "Projects which use mirrors, require that the Source plugins they
   use understand how to use mirrors"

If we must cross that line anyway, there is no reason to declare that a
one size fits all solution is "not doable".

> Since we will have to do separate work for each source, we have the 
> opportunity to make fetching use the
> same protocol as we use for fetching sources normally, so I suggest the 
> following:
>
> In the project.conf, the aliases dict can key to a list of URLs instead 
> of just a single URL, e.g.
>
> > aliases:
> >   github:
> >   - https://mirrorsrv.example.com/github
> >   - https://github.com
> >   sourceforge: http://downloads.sourceforge.net
>
> The implementation of being able to fetch from multiple sources is not 
> trivial, however.
> At its simplest, we update all sources' fetch and track methods to use 
> multiple repo aliases.
>
> To reduce the amount of complexity that we expect plugin authors to 
> write, we might do one of the following:
> * Create a method that takes an aliased URL and yields every URL it can 
> generate from the aliases it knows.
> * Where we currently call fetch and track, iterate over every possible 
> URL and keep calling fetch/track as long as they return an appropriate 
> return value / exception to indicate that it failed because it couldn't 
> access that URL.

That sounds expensive.  I propose we would only do this for sources that do not know how to deal with multiple urls natively.  And have sources grow an option method that does take into account multiple URLs.

> Known issues:
> * Are we likely to see a URL that uses multiple repo aliases?

That is the easiest way to set up mirrorring, no?  That is, have a stable url in the .bst's and aliases to introduce mirrors?  That even allows use of mirrors in the .bst files themselves, by simply providing multiple URLs.
 
> * We are likely to see one mirror alias per type of source.
>    Users who mix many kinds of source with multiple mirrors with have a 
> lot of boilerplate configuration.

Not sure I understand this part. 
 
> Does anyone have a better idea of what we could do?

Your proposal does not actually *require* that Sources learn anything
new as far as I can see; You make the assumption that in one
BuildStream session; multiple mirrors will be tried by Sources - this
*might* make sense but is not necessarily the point.

  A.) Which mirror to use can be something explicit

  B.) At startup time, before loading Sources and resolving aliases;
      mirror URLs could be contacted and scanned.

      1.) We could at least check if the domain is reachable by the
          given URI scheme

      2.) We could possibly even choose the mirror with the lowest
          latency

Are you suggesting doing this at every run?  

C.) Sources could be given a set of URLs instead of one, and have a specific implementation based on kind.  And maybe the behaviour is configurable (e.g. git): fetching from all at once vs fetching from one, checking if ref is present, and doing this in order of the list.  In case of tracking, the tracking branch might be present _yet_ different in the several URLs.  We need to define the preferred one here (order?).

All in all I feel that there is a lot of opportunity for scope creep here.
I think I'm agreeing with Tristan here that we should currently not consider doing the actual source mirroring, but focus on the configuration and use of source mirrors.

That said, I'd like to wait for Jurg to complete the proposal on Remote Execution and Content Addressable Storage based cache.  We may be able to get close to "one size fits all" source caching as a side effect from that.  That is orthogonal though.
 

Cheers,
    -Tristan

Cheers,

Sander
 
_______________________________________________
Buildstream-list mailing list
Buildstream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]