Re: [BuildStream] A BuildStream fetcher service



Hi,

TL;DR I think we're looking at the wrong problem here.

On Thu, Jun 6, 2019 at 5:39 PM Raoul Hidalgo Charman via buildstream-list <buildstream-list gnome org> wrote:
Hi everyone,

Now that we're starting to look at remote execution more, it's been
noticeable that when doing remote builds, uploading and downloading
sources can take a significant portion of time. Ideally we want to be
able to have a remote service that can fetch these for us and ensure
they are in remote caches, bypassing the need to download a source and
then upload it to the remote server.

I think this is drawing the wrong conclusion.  We don't need a remote service.  The situation you describe needs a BuildStream instance to fill the SourceCache and it requires the lifetime of objects in the SourceCache to be long enough so that you don't need to go back to the origin.

What to track when is highly specific to the project.  I can imagine leveraging the source control system hooks to trigger targetted tracking and fetching.  I don't see us requiring infrastructure for this in BuildStream core.
 
If we have configurable endpoints for SourceCache, and SourceCache's CAS, we should be covered.

For remote execution in the CI case I can see some other optimizations that we previously deferred [1].  We can also be smarter in terms of batching uploading of command, action, etc.  And defer checking for missing blobs from the input tree, proactively Execute, and handle PRECONDITION_FAILED.

Cheers,

Sander

[1] https://mail.gnome.org/archives/buildstream-list/2019-April/msg00059.html

This email intends to start
discussion on such a 'fetcher' service, considering a few different
approaches I've thought about, rather than offering a concrete plan.
There are several details that would need to be scoped out more before
implementation can begin on this.

Approaches
----------

Depending on the goals of this service we might want to go about it
differently, with all the following approaches involving having a
BuildStream project running as a service, configured with remote source
caches to push to, and having the elements of which we want to fetch
sources of.

The simplest approach would just be an instance of buildstream which
periodically tracks, fetches and pushes a projects elements with no
interaction from users of the remote cache. This would just need to
implement a `bst source push` method, which should be implemented
anyway. However it has some downsides: it doesn't allow clients to
expand or change sources required by an element, or add elements, and if
CAS expiry causes a source to no longer be present in the cache, clients
will end up downloading and uploading the source anyway. Furthermore
tracking of sources involves downloading the source for many of the
plugins, so without some way of retrieving this information, any client
that wants to track a source will have to download it.

Alternatively we could go for a long running service that clients can
interact with. It would then make sense to implement this with a grpc
service, with a method to track and another to fetch and push sources,
both based on a given an elements configuration. When tracking it should
return the new refs, so that clients can update their source refs,
without needing to download sources and track them themselves.

A simple implementation might then just start the appropriate
BuildStream commands in a process, though I'm unsure how this might
scale if it has to deal with multiple requests at the same time, as
running multiple separate BuildStream instances may cause issues, as
well as having more overhead.

Another approach that requires more effort, would be to add additional
methods to BuildStream that allow us to have a long running instance
that we can asynchronously add elements to. This should use the existing
scheduler queues to track, fetch and push sources using the elements
methods. Current methods used in BuildStream commands aren't quite
appropriate as they load up a pipeline of elements, and run them through
all queues until everything has been processed. We want to be able to
deal with an elements sources as they are requested, which will require
extending the scheduler to allow it to run indefinitely, as well as a
adding methods to put elements on the queues asynchronously. A new
method in Stream could then start the scheduler in this way after
setting up the appropriate queues, allowing us to implement this
similarly to existing BuildStream commands. We can then have a grpc
service as in the previous approach which on a track request or a fetch
request, would add the elements to track and fetch queues respectively,
which will then be processed by the running scheduler.

In the cases where we have a service, when requesting sources the
BuildStream client should first query the remote source cache, then the
fetcher service and if this doesn't work then resort to fetching from
the original URL. If the service still has the source, it can push it to
the source cache when requested, without requiring to fetch the source
again.

Other features and considerations
---------------------------------

A further feature that would be useful and require a service would be
allowing clients to access the original unstaged sources. For some
plugins (namely git), this is required in order to open a workspace, so
having this mirrored on a service would prevent the need to fetch the
original source again.

We also will probably some type of expiry for sources as if clients are
allowed to request new sources or we're tracking and updating the
sources, as this will slowly take up more space. All current BuildStream
expiry is just for CAS objects so we'd also need a method for expiring
the source folders, a simple LRU probably being the best approach.

Trying to ensure that sources aren't downloaded and uploaded to remote
servers where possible will also require quite a bit of reworking to how
element works. All fetch and track methods will have to be updated to
deal with this service when configured with RE, and perhaps some of the
build stage should be possible in this fetcher service too. The build
stage currently stages sources and dependency artifacts with integration
commands, which provide the input directory for building to occur in, so
this would also need to be done server side to avoid downloading and
uploading sources.



Let me know if you disagree with any of these approaches or if you think
there's something else that needs to be considered.

Cheers,
Raoul
_______________________________________________
buildstream-list mailing list
buildstream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]