Re: Discussion on source mirroring (with counter proposal)



Hi Sander,

First I have responded to your replies, below this I have some
extension to this proposal.

On Mon, 2018-03-19 at 14:03 +0000, Sander Striker wrote:
Hi Tristan,

On Mon, Mar 19, 2018 at 6:47 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:

[...]
What are we trying to achieve ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before talking about how to achieve this, I want to pause and think
about what exactly we want to achieve - I feel that we are not all on
the same page about this.

I can see two separate ideas of what "mirror" means here:

  A.) Because a specific third party server proves to be unreliable,
      we want to be able to have a fallback for that server, which we
      can rollover to in the case that the upstream doesnt work.

      So this is a quick fix / bandaid for a specific pain point that a
      given organization experiences while using BuildStream; this
      allows us to have a tarball server for unreliable tarballs, or
      rollover to a github mirror for an unreliable upstream github.

  B.) For the same reasons, an organization may just never want to
      experience unreliable access to source code ever again.

      The cost of hosting all sources which their BuildStream projects
      require on a single server; or even mirrored in some
      strategically placed locations (so that one can choose a mirror
      that is geographically closer when building), is a relatively low
      cost.

      Instead of many points of failure on various servers scattered
      across the globe, a single point of failure *that is under the
      control of the organization in question*, is much more desirable.


While a solution along the lines of (A) can improve things in the short
term, I feel this is just a bandaid and overall, we're living with the
same problem - e.g. a known fallback mirror for an upstream may one day
also prove to be unreliable. Instead, an organization which wants
reliability will eventually move towards (B), and decide to host a
centralized mirror themselves.


I have never really given much thought to the (A) use case, and I
clearly prefer a solution along the lines of (B).

While (A) and (B) are not entirely mutually exclusive (i.e., one could
achieve something like (B) using a solution designed for (A)), I worry
that (A) adds unnecessary complexity, when the goal should ultimately
be (B).

Hold that thought on being able to achieve B when providing A.

Yes, while also bearing in mind that the goal is ultimately B.

 
Unnecessary Complexity
~~~~~~~~~~~~~~~~~~~~~~
The unnecessary complexity I'm talking about is specifically:

  * We need to try multiple servers in a single session, in some way
    or another, this could be:

    - Teaching Sources to do it themselves, as Jonathan proposes

    - Having the core reconstruct and re-instantiate Source objects
      for each alias that they use, when one fails

    - Having the core contact multiple servers at startup time in
      order to choose which mirror is preferable

    Frankly, any of the the above is quite undesirably complex.

  * Configuration API is complex and burdensome to the user, if we
    essentially want to achieve (B) *anyway*, why do I have to list
    fallback mirrors for each and every source alias separately ?



Counter Proposal
~~~~~~~~~~~~~~~~
I have not been clear on the list about what my vision for this is, so
let me layout this counter proposal which I think is both easier to
implement, and also a more robust solution along the lines of the above
expressed (B).


   New Source.mirror() API
   ~~~~~~~~~~~~~~~~~~~~~~~
   For most Source implementations, this is exactly the same as what
   they are doing already in Source.track() or Source.fetch(), but with
   some different guarantees:

     - Guarantee that *everything* is mirrored for the given source,
       regardless of tracking branch or ref.

       This means shortcuts like shallow clones and such are just
       not allowed, and every time Source.mirror() is called, it should
       attempt to get the latest of everything.

     - The local source cache is built in such a way that it is
       reliable for downloading from another location.

       This means that we need an alternative code path for tarball and
       zip (internally `_downloadablefilesource.py`), such that the
       original filename is retained, and the file is not locally
       renamed to be a sha256sum filename instead.

Is this a required method for every Source to implement?  If not, what happens if the Source does not 
implement it?

We have strategies for extending plugin API gracefully; what we do with
plugins which lack support for a given feature must be handled on a
case by case basis.

For mirrors, we can make this a warning which would appear in the
session logs when `bst mirror` is run.

That said, the effort is really not that much for the sources which
exist, newly written sources should really just implement it.

   New `bst mirror` command
   ~~~~~~~~~~~~~~~~~~~~~~~~
   This works much like `bst fetch` or `bst track`, but calls the new
   Source.mirror() method instead.

   One exception is that in this mode there should not be a TARGET
   argument, instead all bst files in the project should be loaded.


   Single mirroring configuration API
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   In project.conf we just provide a URL to the mirror, which will be
   used *instead* of the upstreams listed in project data.

   If we support multiple mirror URLs in project.conf, then a session
   can scan them one time and choose the most optimal mirror.

   If we support user configuration overrides, then we expect the
   project maintainers to communicate the available mirrors to their
   developers or whomever builds that project, such that the user can
   just choose the mirror closest to them.

How do you treat partial mirrors?  If I am dealing with multiple projects with them potentially having a 
subset of each other, what happens?

By default, the expectation should be that project related resources
remain self contained. Similar to artifact caches, cross junction
sources are looked up based on the configuration of their respective
projects.

Also similar to artifact caches, it can make sense for a project to
make a recursive decision which is considered when that project is
toplevel.

If the projects in question are maintained by the same organization,
this recursive override is rather unneeded, because both projects can
be setup with the same mirrors - but the override is nice to have when
one organization is using junctions with projects maintained by
another.

In any of these cases, if the same source is referred to in multiple
places (e.g. linux kernel builds vs the api headers), these will
naturally be mirrored to the same location.


   Alternative implementation of Source.translate_url()
   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   Here the core currently simply expands an alias.

   In the case that we are building/fetching things, and there is a
   configured mirror, we have Source.translate_url() point to the
   mirror instead.

   This part might be a *little* tricky, but is certainly
   straightforward, given that:

     - BuildStream has knowledge of it's own source cache layout,
       i.e.: ${XDG_CACHE_HOME}/buildstream/sources/${source_kind}

     - Sources themselves have knowledge of how things are cached
       inside their dedicated cache directories.

   Resolving the correct URL here is easy.


   Setting up a mirror server
   ~~~~~~~~~~~~~~~~~~~~~~~~~~
   To setup a mirror server, one needs to have some knowledge of
   what things they are hosting, the process for setting up a mirror
   runs mostly like this:

     o Configure BuildStream to have it's source cache in a location
       on the server for hosting.

     o Configure access to the ${source_kind} specific subdirectories
       for the URI schemes which need to be supported.

       I.e. for tarball and zip, just HTTP(s) server is enough.

       For git, you might only also support HTTP(s) access, but you
       may also want to have support for "git://..." URI schemes.

     o Configure your mirror server to periodically do the following:

       - Periodically call `bst mirror` for the latest version of your
         project.

       - For more robust mirroring, you may want to go so far as to
         have a mirror session "triggered" by a commit to the git repo
         which is hosting your BuildStream project. This is just to
         ensure that you *never* miss a beat.

I still feel we are overstepping scope.

If this was going to be difficult to achieve, I would agree.

However given that:

  o We have already *most* of the code in place to achieve this
    already in place with existing Source plugins, the remaining
    gaps are quite easy to fill.

  o The solution we can offer for mirroring, is *much* easier to
    setup, it does not require any complex configuration.

  o The overall experience we have the opportunity to offer is
    more practical than configuring mirroring solutions separately,
    there is less pain involved with this approach on an ongoing basis.

    This is because the workload of the mirror server is directly
    driven by the project data which depends on it, so you dont need
    to ever remember to start or stop mirroring some source code
    separately, when your project starts depending on a new source,
    or stops using an old source.

It seems to me that this does not qualify as scope creep, rather: we
are best situated to provide a solution to this, and at a low cost.


Note that that last point list above is actually a very desirable
property, looking back at trove / baserock:

  o One had to first get approval to land a patch to the mirroring
    service before one could even consider upstreaming a patch to the
    build metadata. In some cases, you needed the mirror to first
    do it's thing before you could even test your build metadata
    patches locally.
    
  o The mirroring service grew and grew and grew, because people
    cannot and should not be relied upon to separately *stop*
    mirroring things when those were no longer in use.

    This became a pain point for us over time, not so much in disk
    space, but in processing time for mirroring runs.


With a project driven mirroring solution, you never delete stuff that
is no longer in use, which means you can still reliably reproduce
builds throughout your project's history. But you stop continuously
mirroring old things which are no longer in use.


  The issue of source persistence is not unique to BuildStream. 
Organizations may already have solutions for this in place, which
they would like to continue to leverage.  What type of solution is in
place is dependent on Source types; git and subversion are different
beasts than say a package repository.  They may exhibit different
scalability characteristics as well.
By requiring that mirrors are BuildStream created/managed mirrors
dismisses those solutions.  Or at least complicates their use.

I think the scope creep arguments and interoperability arguments are
separate topics.

I have expressed strong opinions regarding the former, the latter
however is not falling on deaf ears... more below on this topic.

 
Properties of the counter proposal
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
My counter proposal, while being a bit of code that needs writing; is
less complex, it does not imply the "Unnecessary Complexity" drawbacks
which I have highlighted above.

The code involved in this solution, perhaps involves some boilerplate,
but is actually *easier to write* and very straight forward.

By forcing a mirror to be a single location, there is less points of
failure, and the one point of failure is under the control of the
people who maintain the said BuildStream project.

What this proposal does *not* do however, is add any possibility for
the bandaids described as (A) above, however it does provide a
practical solution for (B) - users who want (A) should be satisfied by
(B) as well - but the opposite is not entirely true.


I would very much like to hear feedback on this, particularly I would
like to know if I've missed something about the (A) approach which is
absolutely needed even in the presence of a (B) solution, and/or if it
is more desirable/necessary to have sessions try multiple URLs for the
same source in the same session - or, anything else I may have missed.

I think the project focus of the mirror is going to make the setup
more complicated for multiple projects, as you now need to start
creating a composed single project to ensure your mirroring needs are
covered.

This is untrue, see my response above to your question about partial
mirrors.


I further think that not being able to reuse existing mirror[ing solutions] is a negative.

Right, this is a negative.

On the one hand, I see a low hanging fruit opportunity to provide a
solution that has no complex configuration, "just works out of the box"
and requires very low, almost no continuous maintenance effort. On the
other hand, I do not want to close the door on interoperability with
third party solutions, should people desire this.

If one desires to use another solution for mirroring however, they will
own the additional pain of configuring that mirroring service
separately from project data (or they can develop some scripts to use
some `bst show` features to inform their external mirroring service
automatically, if they want project driven mirroring) - and will own
the overhead of requiring some more complex configuration data to
inform BuildStream how to resolve URLs.

For completeness, I'll extend my proposal above with an outline here of
how I think interoperability should work:

  o In project configuration, one can define one or more "mirrors".

  o The first defined "mirror" is the default mirror, however a user
    can choose to use a specific project defined mirror, allowing the
    user to select a closer / faster mirror.

  o Only one mirror is ever used in a given session, regardless of
    session type; including `bst mirror` sessions.

  o A mirror defines a mapping of how aliased URLs used in project
    data get resolved to fully qualified URLs.

    - If possible, generalizations can be made here such that
      a mapping can be defined for a given "source kind" (plugin).
    - Otherwise, a simple alternative mapping of aliases can be used.

  o There is no restriction that a given alias / source kind be
    provided from the same domain.

    This is important to note, because it is a non-intuitive
    usage of the word "mirror".

  o A mirror may be allowed to have "gaps" and be incomplete, in which
    case project default aliases are used.

  o A mirror may optionally override how sources are obtained in
   
subprojects - providing an "opt in" possibility of overriding
    a
subproject's mirrors.

  o A mirror can be defined as a simple base URL, or a simplified
    dictionary.

    Following a repeating pattern in BuildStream YAML, some entities
    can be defined as a simple string, or as a complex dictionary,
    this is a pattern we use when the complex dictionary can have
    implied defaults.

    In the case of a mirror defined as a simple string, it is assumed
    to be a URL of a BuildStream constructed mirror, so all aliases
    can be resolved using BuildStream's knowledge of how a source cache
    is constructed on a mirror, based on that URL.

This approach is to allow interoperability, without setting the bar so
high as to *require* it, still allowing for an easier to use turn-key
solution as the recommended default.

Also, this approach is very strict about addressing the (B) use case.
When we define a mirror; we really expect that the mirror is complete
for all sources, and only one mirror needs to be used.

NOTE: While ultimately I would like to have a turn-key solution
      understood by BuildStream core and easier to use, I am not
      opposed to implementing only the client side of things as a
      first step; with the above outlined style of configuration.

Finally; (and rather unfortunately) there is still one case where a
fallback / roll-over is actually required - although it may be
interesting to have an option to disallow this roll-over explicitly in
production builds:

   When developing a project, a developer needs to add a new Source
   to the project, and possibly build and test a firmware before
   upstreaming any changes to the project.

   In this case, it is important that BuildStream is able to fallback
   to the true upstream URL, as we cannot expect that the source has
   ever been mirrored *yet*.

   As a matter of control and trust; it is desirable to optionally
   disable this rollover, such that a build is guaranteed to have
   only ever obtained sources from a specific mirror.


Circling back to Jonathan Maw's proposal about how rollovers can
happen, I want to stress that the onus should always be on the core
wherever possible, making life as easy as possible for plugins.

As such, I think that any rollover should be implemented with Source
object re-instantiation, where the core has the opportunity to report a
different fully qualified URL via Source.translate_url() in a second
round instantiation - I very much dislike handing out URL lists to the
Source plugins, this gives too much business logic to the Source which
is really not their responsibility.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]