Re: Discussion on source mirroring



Thanks for the writeup !

On Thu, 2018-03-15 at 16:30 +0000, Jonathan Maw wrote:
I've been giving some thought on source mirroring, recently, after 
reading the discussion at 
https://gitlab.com/BuildStream/buildstream/issues/179.

Source mirroring will be valuable for us because:
* The canonical upstream may disappear without warning
* The canonical upstream may be slow to access due to limited 
infrastructure or geographical distance.

* The organization may be mirroring everything in a local build farm
  - To be sure that their builds are repeatable in 10 years
  - To optimize fetch times on build machines
  - Without losing the information of what the original URL was


I briefly considered whether it's possible to do a "one size fits all" 
source mirror, but I don't think it's doable.
Each source is permitted to store files in the local source cache in 
whatever format they feel is appropriate - as a result, merging the 
remote and the local cache is dependent on which methods are suitable 
for each kind of source.

I'm not sure what a "one size fits all" means here; so for the purpose
of this email, I'm going to interpret it as:

  "A solution which requires only that a project can provide a list
   of mirrors, expect that BuildStream has populated them, and walk
   the mirrors, regardless of what Source plugin type is in use"

With current Sources in place, we can /mostly/ achieve this with a
local source cache hosted on apache, such that implementing a mirror is
a matter of running `bst fetch` (or a similar new command) on a mirror
host periodically, on a list of projects.

I personally would like to see:

* A possibility to teach Sources some new tricks for this
  - May include caching differently, to ensure hosting over HTTP(s)
    works
  - Must include a new method for *pulling everything*, not just
    the requested tracking branch or ref

* A configuration setup which would allow getting sources from an
  alternative URLs

* Hopefully some comprehensive configuration API, with the expectation
  that both solutions are a possibility.

And I would like that the second approach, more similar to what you are
proposing here, is implemented first; without *requiring* an eventual
enhancement, but with forward thinking so that a one size fits all
mirroring solution is not *prohibited* in the future.

Basically, something similar to what you propose is more useful more
quickly, but lets not shut the door on an eventual, more complete
solution, lets ensure that we can do that as another step in the
future.

Note that your proposed implementation path already teaches a lot to
Sources, so we already cross a line where:

  "Projects which use mirrors, require that the Source plugins they
   use understand how to use mirrors"

If we must cross that line anyway, there is no reason to declare that a
one size fits all solution is "not doable".

Since we will have to do separate work for each source, we have the 
opportunity to make fetching use the
same protocol as we use for fetching sources normally, so I suggest the 
following:

In the project.conf, the aliases dict can key to a list of URLs instead 
of just a single URL, e.g.

aliases:
  github:
  - https://mirrorsrv.example.com/github
  - https://github.com
  sourceforge: http://downloads.sourceforge.net

The implementation of being able to fetch from multiple sources is not 
trivial, however.
At its simplest, we update all sources' fetch and track methods to use 
multiple repo aliases.

To reduce the amount of complexity that we expect plugin authors to 
write, we might do one of the following:
* Create a method that takes an aliased URL and yields every URL it can 
generate from the aliases it knows.
* Where we currently call fetch and track, iterate over every possible 
URL and keep calling fetch/track as long as they return an appropriate 
return value / exception to indicate that it failed because it couldn't 
access that URL.

Known issues:
* Are we likely to see a URL that uses multiple repo aliases?
* We are likely to see one mirror alias per type of source.
   Users who mix many kinds of source with multiple mirrors with have a 
lot of boilerplate configuration.

Does anyone have a better idea of what we could do?

Your proposal does not actually *require* that Sources learn anything
new as far as I can see; You make the assumption that in one
BuildStream session; multiple mirrors will be tried by Sources - this
*might* make sense but is not necessarily the point.

  A.) Which mirror to use can be something explicit

  B.) At startup time, before loading Sources and resolving aliases;
      mirror URLs could be contacted and scanned.

      1.) We could at least check if the domain is reachable by the
          given URI scheme

      2.) We could possibly even choose the mirror with the lowest
          latency

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]