Re: Discussion on source mirroring (with counter proposal)



Hi Tristan,

On Mon, Mar 19, 2018 at 6:47 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:

[...]
What are we trying to achieve ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before talking about how to achieve this, I want to pause and think
about what exactly we want to achieve - I feel that we are not all on
the same page about this.

I can see two separate ideas of what "mirror" means here:

  A.) Because a specific third party server proves to be unreliable,
      we want to be able to have a fallback for that server, which we
      can rollover to in the case that the upstream doesnt work.

      So this is a quick fix / bandaid for a specific pain point that a
      given organization experiences while using BuildStream; this
      allows us to have a tarball server for unreliable tarballs, or
      rollover to a github mirror for an unreliable upstream github.

  B.) For the same reasons, an organization may just never want to
      experience unreliable access to source code ever again.

      The cost of hosting all sources which their BuildStream projects
      require on a single server; or even mirrored in some
      strategically placed locations (so that one can choose a mirror
      that is geographically closer when building), is a relatively low
      cost.

      Instead of many points of failure on various servers scattered
      across the globe, a single point of failure *that is under the
      control of the organization in question*, is much more desirable.


While a solution along the lines of (A) can improve things in the short
term, I feel this is just a bandaid and overall, we're living with the
same problem - e.g. a known fallback mirror for an upstream may one day
also prove to be unreliable. Instead, an organization which wants
reliability will eventually move towards (B), and decide to host a
centralized mirror themselves.


I have never really given much thought to the (A) use case, and I
clearly prefer a solution along the lines of (B).

While (A) and (B) are not entirely mutually exclusive (i.e., one could
achieve something like (B) using a solution designed for (A)), I worry
that (A) adds unnecessary complexity, when the goal should ultimately
be (B).

Hold that thought on being able to achieve B when providing A.
 
Unnecessary Complexity
~~~~~~~~~~~~~~~~~~~~~~
The unnecessary complexity I'm talking about is specifically:

  * We need to try multiple servers in a single session, in some way
    or another, this could be:

    - Teaching Sources to do it themselves, as Jonathan proposes

    - Having the core reconstruct and re-instantiate Source objects
      for each alias that they use, when one fails

    - Having the core contact multiple servers at startup time in
      order to choose which mirror is preferable

    Frankly, any of the the above is quite undesirably complex.

  * Configuration API is complex and burdensome to the user, if we
    essentially want to achieve (B) *anyway*, why do I have to list
    fallback mirrors for each and every source alias separately ?



Counter Proposal
~~~~~~~~~~~~~~~~
I have not been clear on the list about what my vision for this is, so
let me layout this counter proposal which I think is both easier to
implement, and also a more robust solution along the lines of the above
expressed (B).


   New Source.mirror() API
   ~~~~~~~~~~~~~~~~~~~~~~~
   For most Source implementations, this is exactly the same as what
   they are doing already in Source.track() or Source.fetch(), but with
   some different guarantees:

     - Guarantee that *everything* is mirrored for the given source,
       regardless of tracking branch or ref.

       This means shortcuts like shallow clones and such are just
       not allowed, and every time Source.mirror() is called, it should
       attempt to get the latest of everything.

     - The local source cache is built in such a way that it is
       reliable for downloading from another location.

       This means that we need an alternative code path for tarball and
       zip (internally `_downloadablefilesource.py`), such that the
       original filename is retained, and the file is not locally
       renamed to be a sha256sum filename instead.

Is this a required method for every Source to implement?  If not, what happens if the Source does not implement it?

   New `bst mirror` command
   ~~~~~~~~~~~~~~~~~~~~~~~~
   This works much like `bst fetch` or `bst track`, but calls the new
   Source.mirror() method instead.

   One exception is that in this mode there should not be a TARGET
   argument, instead all bst files in the project should be loaded.


   Single mirroring configuration API
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   In project.conf we just provide a URL to the mirror, which will be
   used *instead* of the upstreams listed in project data.

   If we support multiple mirror URLs in project.conf, then a session
   can scan them one time and choose the most optimal mirror.

   If we support user configuration overrides, then we expect the
   project maintainers to communicate the available mirrors to their
   developers or whomever builds that project, such that the user can
   just choose the mirror closest to them.

How do you treat partial mirrors?  If I am dealing with multiple projects with them potentially having a subset of each other, what happens?
 
   Alternative implementation of Source.translate_url()
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Here the core currently simply expands an alias.

   In the case that we are building/fetching things, and there is a
   configured mirror, we have Source.translate_url() point to the
   mirror instead.

   This part might be a *little* tricky, but is certainly
   straightforward, given that:

     - BuildStream has knowledge of it's own source cache layout,
       i.e.: ${XDG_CACHE_HOME}/buildstream/sources/${source_kind}

     - Sources themselves have knowledge of how things are cached
       inside their dedicated cache directories.

   Resolving the correct URL here is easy.


   Setting up a mirror server
   ~~~~~~~~~~~~~~~~~~~~~~~~~~
   To setup a mirror server, one needs to have some knowledge of
   what things they are hosting, the process for setting up a mirror
   runs mostly like this:

     o Configure BuildStream to have it's source cache in a location
       on the server for hosting.

     o Configure access to the ${source_kind} specific subdirectories
       for the URI schemes which need to be supported.

       I.e. for tarball and zip, just HTTP(s) server is enough.

       For git, you might only also support HTTP(s) access, but you
       may also want to have support for "git://..." URI schemes.

     o Configure your mirror server to periodically do the following:

       - Periodically call `bst mirror` for the latest version of your
         project.

       - For more robust mirroring, you may want to go so far as to
         have a mirror session "triggered" by a commit to the git repo
         which is hosting your BuildStream project. This is just to
         ensure that you *never* miss a beat.

I still feel we are overstepping scope.  The issue of source persistence is not unique to BuildStream.  Organizations may already have solutions for this in place, which they would like to continue to leverage.  What type of solution is in place is dependent on Source types; git and subversion are different beasts than say a package repository.  They may exhibit different scalability characteristics as well.
By requiring that mirrors are BuildStream created/managed mirrors dismisses those solutions.  Or at least complicates their use.
 
Properties of the counter proposal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My counter proposal, while being a bit of code that needs writing; is
less complex, it does not imply the "Unnecessary Complexity" drawbacks
which I have highlighted above.

The code involved in this solution, perhaps involves some boilerplate,
but is actually *easier to write* and very straight forward.

By forcing a mirror to be a single location, there is less points of
failure, and the one point of failure is under the control of the
people who maintain the said BuildStream project.

What this proposal does *not* do however, is add any possibility for
the bandaids described as (A) above, however it does provide a
practical solution for (B) - users who want (A) should be satisfied by
(B) as well - but the opposite is not entirely true.


I would very much like to hear feedback on this, particularly I would
like to know if I've missed something about the (A) approach which is
absolutely needed even in the presence of a (B) solution, and/or if it
is more desirable/necessary to have sessions try multiple URLs for the
same source in the same session - or, anything else I may have missed.

I think the project focus of the mirror is going to make the setup more complicated for multiple projects, as you now need to start creating a composed single project to ensure your mirroring needs are covered.

I further think that not being able to reuse existing mirror[ing solutions] is a negative.

Cheers,

Sander



Regards,
    -Tristan

_______________________________________________
Buildstream-list mailing list
Buildstream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]