Re: [BuildStream] bst artifact show - PROPOSAL

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: James Ennis <james ennis codethink co uk>
Cc: BuildStream-list <buildstream-list gnome org>, "Ben.c Schubert" <ben c schubert gmail com>
Subject: Re: [BuildStream] bst artifact show - PROPOSAL
Date: Wed, 07 Aug 2019 13:07:21 -0400

Hi James,

On Tue, 2019-08-06 at 16:50 +0100, James Ennis wrote:

Hi Tristan,

[...]

So most of my inspiration for this came from the fact that we should try
to be consistent with our current artifact subcommands. However, all of
our current subcommands are not currently consistent! The subcommands I
was trying to be consistent with were `bst artifact log/delete`, which
do not support a --deps option. So, we don't actually ever load a
dependency graph - regardless of the arguments.

In hindsight, I agree that a --deps options is important so we should
ensure this.

As things currently stand, given an artifact ref (or list of refs), we
do not have support for loading a dependency graph but, given a Element
name, or a list of Element names, we do.

The above is the exact reason why `bst artifact push/pull/checkout` do
not support taking artifact refs as arguments.

Way back when Richard was working on this (!1008 [0]) there was some
(ugly) logic in the frontend which bailed out of `bst artifact checkout`
if `--deps` was not None AND a ref was present [1].


Yes this is precisely why I am raising the detail, so that the non
trivial work involved in loading the dependency graph via artifacts
(without loading project data) does not get overlooked, and we don't
face the same limitations again.

When Richard originally landed that work to accept artifact refs on the
command line, it was clear that we were still missing this ability to
load build graphs recursively, I think it was clearly unfinished and
hope that we don't leave it unfinished.

So we could implement something like this for `bst artifact
show` and that should (partially) address your first two points. But it
seems to me that until we have support for loading a dependency graph
from an artifact ref, there isn't a "cleaner" way around this (unless
I'm missing something!).


Exactly, the code to load the build graph from artifacts is presently
missing, I wouldn't personally go ahead with more artifact subcommands
without implementing this bit first.

This is also a very valuable bit, as it lets us reconstruct a build
graph in the absence of the precise project state which created the
artifacts (I can tell someone to just recursively check something out
from an artifact server with a specific cache key, I can do my
deployment without needing the project at hand at all - I can also
better answer important questions about the provenance of an artifact
by reconstructing the metadata of it's dependencies).

  * Re-raising again that there was discussion around it being
    desirable to also address artifacts only by cache key (without
    needing to use the full ref, making the full ref name mostly
    just for user convenience).


Ah, yes. I vaguely remember reading this. I agree with what you say
about leaving this out as a separate topic. I'll look for/file an issue
and mark it as a follow up to this work once this discussion has
settled.

It's also not entirely clear to me how you intend to implement querying
of remotes, for instance, is there a shallow "pull queue" which ensures
we download just the shallow artifacts (metadata only) which runs first
before actually showing anything ?


Apologies, I should have covered this in the first email.

In short, no, I was not thinking of using anything like a shallow pull
queue. Instead, I was thinking along the lines of splitting up our
current ArtifactCache._pull_artifact() [2] method into two methods - as
the availability of the artifact is *kind of* performed here. These
methods would be _check_available() and _pull_artifact(), where the
second is called only if the first returns True. This would be one of
the prior commits.

It would probably then make sense to add an Element.check_if_in_remote()
(or similar) method (which calls ArtifactCache._check_available() - or
we could have the artifact cache method non-private?). And a similar
method would exist for ArtifactElements.

I would then imagine Stream.artifact_show() to do something like this:
1. Load project (including the initialisation of remotes)
2. Load the targets to obtain a list of Elements and/or ArtifactElements
3. Loop over the list of remotes (or simply take the specified remote),
    then for each remote, loop over the list of objects call
    Element/ArtifactElement.check_if_in_remote() for each
    Element/ArtifactElement
4. Each time this returns true, we populate a new list of tuples which
    contain the artifact ref and the remote url.
5. We then output this list to the user in a pretty way (described in
    the last email).

Note that 1. and 2. are already handled by Stream._load().

Also note that we can obtain a sorted list (which represents the
dependency graph) in the Stream so I don't think this changes much when
we consider the --deps option (apart from the fact that the list is much
larger).

I believe this should achieve what I've previously described.


This is a lot more implementation detail than I meant to provoke, sorry
for that haha, still I appreciate the detail :)

On this I think that:

  * We already have this concept of "partial artifacts", artifacts
    which dont have build trees, artifacts which lack things.

    This seems like a natural extension to just treat an artifact
    with only meta data as a shallow "partial artifact" and blend
    in with that API.

  * We've made mistakes in the past where we've just called
    out to network resources from the main process, ignoring whatever
    the networking related user preferences say and without providing
    the right feedback about what is going on (implicit fetches of
    subprojects without any UI feedback or queueing for example).

    I think we aught to not repeat that here, if we're going to fetch
    the artifacts we should do it through the regular scheduler
    mechanics.

Is the downloaded artifact metadata persisted in the local artifact
cache so that a subsequent show command need not re-download the
shallow artifacts ?


I like this idea, but what if much later, the artifact in the remote was
expired? A subsequent show command might show us false information.


Asides from the finer points of what could be done with knowledge about
artifacts which existed at some point in the past... caching the
metadata still saves us download time.

If the server does indeed have the artifact, then we need not download
it's metadata if we already have it.

Cheers,
    -Tristan

References:
- Re: [BuildStream] bst artifact show - PROPOSAL
  - From: James Ennis
- Re: [BuildStream] bst artifact show - PROPOSAL
  - From: Tristan Van Berkom
- Re: [BuildStream] bst artifact show - PROPOSAL
  - From: James Ennis

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]