Re: [BuildStream] A BuildStream fetcher service

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: Raoul Hidalgo Charman <raoul hidalgocharman codethink co uk>, BuildStream <buildstream-list gnome org>
Subject: Re: [BuildStream] A BuildStream fetcher service
Date: Fri, 07 Jun 2019 17:29:16 +0900

Hi Raoul,

Upon initially seeing this message, I thought this was a lot of words
for what appears to be a pretty simple thing - but I can see that you
have put a lot of thought into the requirements, great work :)

I do have some ideas here.

Keep in mind that my comments here have the following objectives:

  * Implement simple features in BuildStream which can potentially
    be useful for multiple use cases

  * Avoid scope creep by implementing simple things and focus on
    scriptability of the command line tool


On Thu, 2019-06-06 at 16:38 +0100, Raoul Hidalgo Charman via
buildstream-list wrote:

Hi everyone,

Now that we're starting to look at remote execution more, it's been
noticeable that when doing remote builds, uploading and downloading
sources can take a significant portion of time. Ideally we want to be
able to have a remote service that can fetch these for us and ensure
they are in remote caches, bypassing the need to download a source and
then upload it to the remote server. This email intends to start
discussion on such a 'fetcher' service, considering a few different
approaches I've thought about, rather than offering a concrete plan.
There are several details that would need to be scoped out more before
implementation can begin on this.

Approaches
----------

Depending on the goals of this service we might want to go about it
differently, with all the following approaches involving having a
BuildStream project running as a service, configured with remote source
caches to push to, and having the elements of which we want to fetch
sources of.

The simplest approach would just be an instance of buildstream which
periodically tracks, fetches and pushes a projects elements with no
interaction from users of the remote cache. This would just need to
implement a `bst source push` method, which should be implemented
anyway.


I agree with `bst source push` being an important command, but I
disagree that this is a requisite to implementing something useful
here.

This setup assumes that the fetching occurs on a different machine
where the sources, I think this is only one of the setups a user might
want to use.

Take into context for instance this long standing enhancement request:

    Feature request: sharing built artifacts between bst instances
    https://gitlab.com/BuildStream/buildstream/issues/415

I think it really makes sense to be allowed to have an artifact server
operate on a buildstream local cache, such that one should be able to
build something and share the result with others (whether the builds
are automated or not in this context).

My initial thoughts when looking at the subject line was that this was
going to be a script which runs `bst track` and `bst fetch`
periodically on a machine, and just shares it's artifacts without
actively pushing them.

 However it has some downsides: it doesn't allow clients to
expand or change sources required by an element, or add elements, and if
CAS expiry causes a source to no longer be present in the cache, clients
will end up downloading and uploading the source anyway. Furthermore
tracking of sources involves downloading the source for many of the
plugins, so without some way of retrieving this information, any client
that wants to track a source will have to download it.


Downside A: "it doesn't allow clients to expand or change sources
             required by an element"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
So my take on this is that this script makes some further assumptions
about how the user is managing their project, i.e. the script is not
itself a part of BuildStream core and can take some liberties, it just
basically calls bst commands on a cron job.

As such: The script can assume that it is fetching on behalf of a
project that is *revisioned in git*.

So the script naturally needs to be configured with:

  * The upstream git URL of the BuildStream project it is fetching on
    behalf of
  * The branch (or list of branches) in the upstream BuildStream
    project which it should be fetching on behalf of
  * The target(s) to fetch/track for (or not, if you are relying on
    the new "default" target(s) feature)
  * The remote CAS to push to, if any (assuming you might just be
    hosting from the local CAS on that machine)

When the script runs, triggered by a cron job... it does the following
things:

  (A) Self updates from git

      This allows one to fix the script remotely when an error is
      observed, usually by tweaking some of it's configuration, so
      that it is successful on the next cron job.

  (B) For every branch of the BuildStream project it is intended
      to fetch/track for do the following:

      - git fetch the latest of the configured branch tip and
        get yourself a clean working tree/checkout of that branch.
      - Run bst track on the specified targets
      - Run bst fetch on the specified targets
      - Run bst source push, if configured to do that


This means that one can almost setup this service today if #415 were to
be implemented (which doesn't seem all very challenging to implement on
it's own, only cache cleanup needs to be decided, whether it should be
the server or the `bst` command which is responsible for purging
expired objects from the cache).



Downside B: Furthermore tracking of sources involves downloading the
source for many of the plugins, so without some way of retrieving this
information, any client that wants to track a source will have to
download it.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
This is another unfortunate downside of our mistake in having allowed
system level installation of plugin files with the `pip` origin in the
first place.

Plugins are either distributed with BuildStream core, or are project
bound, BuildStream should really be taking care of ensuring that we
obtain all of the plugins at startup - or at the very least without
automation, the plugins a project uses beyond the upstream core set,
should be 'local' plugins and provided with tools like git submodules.


In *any* scenario (even if you have installation bound plugins), you
are going to have to babysit this service when new plugins appear with
host tool requirements that are not satisfied by the host running the
service (your life will be a bit easier though when your custom plugins
are project bound and you access them with git submodules).

My suggestion here would be simply to have it fail loudly, ensure that
the scripting around BuildStream does things like sends an email or an
IRC notification when the process fails and posts the error log
somewhere. Here is an example of a simple IRC notification script:

    https://github.com/flatpak/flatpak-build-scripts/blob/master/extra/irc-notify.py

This way people who are running the service can observe the error
message from a Plugin.preflight() method which says:

   "I am missing this host tool or host library ! please install it !"

In practice, this case should not happen all that often, so I think the
maintenance involved in ensuring that the host running a script like
this is fairly minimal (once off the ground, I wouldn't expect to have
to add a new host dependency to the fetching host more than 2 or 3
times a year for a given project, once the process has run successfully
at least once with a lot of plugins already available).

If you have installation bound plugins, this will happen a lot more
often and you will need to periodically log into that fetcher host and
update the plugins for the service (i.e. if your plugins ever change or
add new APIs that the project needs, they need to be kept in sync
somehow).

[...]

Let me know if you disagree with any of these approaches or if you think
there's something else that needs to be considered.


In summary, I think don't like the idea of scope creeping BuildStream
and adding more services here, I would very might like to keep it a
simple command line tool and limit the API surface to such.

I think that by implementing either #415, or `bst source fetch`, or
ideally both, we can provide all the mechanics which are needed in
BuildStream for a user to achieve what they want.

This looks like perfect material for a well documented shell script to
live in the contrib/ directory of BuildStream, allowing people to use
and copy and modify however they like.

Cheers,
    -Tristan

References:
- [BuildStream] A BuildStream fetcher service
  - From: Raoul Hidalgo Charman

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]