Re: [BuildStream] Proposal: bst artifact subcommand group



Hi,

On Sun, Sep 16, 2018 at 9:59 AM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:
Hi Sander,

This one seems to keep falling through the cracks, picking it back up ;-)

It is one of those threads :)
 
On Wed, 2018-09-12 at 15:57 +0100, Sander Striker wrote:
> Hi,
>
> On Tue, Sep 11, 2018 at 12:26 PM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:
> > On Mon, 2018-09-10 at 17:01 +0200, Sander Striker wrote
>
> [...] 
> > > I am assuming the $(cache_key) is actually unique enough on its own?  
> > > And this name is purely symbolic, contextualized for a project? 
> > > Nothing would require this under the hood?  As in, I wouldn't expect
> > > a remote artifact cache to carry the $(project-name)/$(element-name)
> > > as part of its keys(?).
> >
> > Since the beginning, the refs we use to store / retrieve an artifact
> > have always been namespaced as:
> >
> >   ${project-name}/${element-name}/${cache-key}
> >
> > Whether to change that is an interesting discussion indeed.
>
> I'm merely thinking about potential reuse; and whether the
> namespacing is causing an artificial reason why cache results are not
> shareable.

Not an artificial reason really no.

Consider two elements from different projects, which purport to have
the same inputs, these will produce the same `${cache-key}` but we
cannot go so far as to say that the output for these two inputs can be
the same.

That would be unfortunate.  So if you include e.g. webkit in two projects that share an ArtifactCache, you'll be building it twice?
 
That would only be a good assumption for the `files/` subdirectory of
the artifact, i.e. the output that a build actually generated.

Right...  I think we're getting to the core of the issue :).  This part of the artifact is the expensive part to produce.  And any opportunity we have not to redo work here, we should take IMO.
 
Depending on what kind of provenance information we encode in files in
the `meta/` directory, and what kind of information is included in the
build logs, this will be incorrect.

Sure.
 
If for instance, we want to encode information about from which project
the artifact came from in it's metadata; then reusing an artifact
generated by one project in another project will certainly be
incorrect; even if the `files/` output stored in that artifact are the
same.

It makes me think that we may need an indirect mapping here.  One where we can share the binaries between the Artifacts and maybe fill out the "information about from which project
the artifact came from in it's metadata" when we build in the context of said project.

I'm expecting that, when using remote execution, we actually reuse much of the builds, due to the fact that Actions care about the source input tree, and not where the source input tree came from.  That concept could maybe be extended to an implementation for local builds as well?

There is another angle to think about here too, which is the aliases
which belong to the project.conf which produced an element: If the
artifact is not produced within at least it's project's namespace; then
we don't have any assurances at all that the aliases used to produce
the final URLs used in Sources, are in fact the URLs we expect.

Looking at https://gitlab.com/BuildStream/buildstream/blob/master/buildstream/element.py#L2015 there is a whole lot that goes into the cachekey.  Definitely more than I expected.  If the ArtifactCache was meant as an interface with different implementations, why is the name of the implementation included in the cache key?
 
Interestingly this again touches on issue 569 which I raised but is not
really related: By not binding the outputs of a project into a
namespace, we are weakening the statement that `refs` can in fact be
unique for the URLs we declare them for (because now the URLs fall out
of the equation, as those are only defined by the project.conf).

I see your point.

So if we can have cache key map to a source key (the root of a merkle tree representing the staged sources), then you can create a cache key function that substitutes all of the sources with the source key, and it will be correct without the element and project names. 
 
> > First, I should say that it is entirely possible, likely even, that two
> > differing artifacts carry the same key as things currently stand.
> >
> > This is because we never considered the element name in the key, so two
> > elements with the same inputs from different projects for instance, can
> >  produce the same cache key, without this namespace we risk mixing
> > those different artifacts (can have bad consequences, the logs might be
> > wrong, or artifact metadata might yield incorrect things).
>
> You're thinking of both artifacts already being produced
> individually, and now being joined in the same namespace?
>
> I was thinking from the perspective of in new builds the build for
> the second element would see a cache hit, given the output is going
> to be equivalent?

Right, as stated above, only the output would be reasonably equivalent,
not the rest of the artifact; and that would be wrong.

> > That said, at this point I don't really feel strongly about this design
> > point, except that I find it impractical from a UI perspective to use
> > cache key only instead of full artifact names.
>
> The other use case to consider is restructuring your element tree,
> either by renaming or moving the .bst file.  If the inputs remain the
> same, there would be no need to rebuild.  The artifact name now
> doesn't bear a relation to the actual element.

I didn't bring this up but indeed, this was my initial reasoning behind
not considering the element names in artifact cache keys.

I think however that this flew out the window the moment that we
started considering the "shape of the dependency tree" in the weak
artifact cache keys (for non-strict build mode).

Ugh.
 
Now we have a weird situation where, when weak keys are employed, the
names and paths of elements are important.

For non-strict builds we have the current state of the project to work with, right?
 
I'm not sure if it is still possible to assert that, if I create a
dependency tree, and then rename a .bst file and references to it, it
will not trigger a rebuild.

We should probably write out what happens step by step here...
 
This would indeed be worth an added regression test if we want to
assert that this should be possible.

 
> It appears we have two needs to satisfy here, and they may not have to be solved the same way:
> 1 - we need a way for our system to be efficient and effective in reusing artifacts that have the same inputs.
> 2 - we need a way for humans to make sense of artifacts outside of the context of a project/elements at a specific version.
>
> For 1, the use of naked cache_keys seems the most obvious way to maximize this property.
> For 2, having a UI that shows artifacts by project/element name, potentially ordered by time.
>
> Note that this may mean that we have a need for multiple mappings, to get both.

For reasons outlined in the beginning of my reply here; the project and
element path should be considered an important part of the artifact
key, because they should be assumed to effect the output.

While this is not true for the `files/` subdirectory, it will be true
for the `artifact` overall, i.e. it's metadata.

I'm coming from the angle that we should never need to repeat the production of "files/" for something we have already built.
 
> > Since we have not introduced this lookup key in any public API yet, now
> > is a good time to have this discussion - changing this detail mostly
> > means that we need to ensure it's changed in all the right places and
> > bump the base artifact version for it.
> >
> > Maybe there is an opportunity to include the element name in the cache
> > key algo, and *also* support lookup by full artifact name.
> >
> > I have a few considerations here:
> >
> >   * Tab completion
> >
> >     I would expect a user to not have too many artifacts for a given
> >     element, and I would expect the CLI to autocomplete artifact names
> >     while typing.
> >
> >     With YBD I recall, since we were storing artifacts as tarballs,
> >     we had the tarballs as ${cache-key}.tgz in subdirectories named
> >     after the things they built.
>
> Why?  If you know the cache_key you don't need autocompletion, unless you are typing it instead of copy-'n-paste. 

Because you don't know the cache key. Similarly, you don't know the URL
of any google document, because those URLs are impossible to remember
by humans.

The use case I was thinking of was: I have $cachekey_1 in production, I have $cachekey_2 running on a canary...
In other words, I do know the cache_key.
 
>  If you don't know the cache_key, then you probably want the artifact associated with your current project state.

I want to `ls` my artifacts and see which ones I have around, and "do
stuff with them", I might even want to store some aside after a build
by extracting them somehow and avoiding their getting deleted by LRU
expiry, so that in the future I might compare it or observe it's logs
for whatever reason I might want.

I am of the mind that you wouldn't do this on arbitrary artifacts.  You would do it on the ones that either you know are causing an issue (they are reported issues against), or they are available in context of your project because you just/recently built them.
 
All of this is very straight forward and user friendly when your
artifacts are tarballs; I can definitely say that it was very useful to
me as a user of YBD to `ls -l` all the artifacts of a given name, see
the mtimes, and even test older generated VM images and compare those
images to the ones I just built.

Which ones though?  The latest?  A well known set?
 
>   If not, then it really comes down to "just give me any artifact" for this project/element.  I would imagine that "latest" would be a useful qualifier here.
> Are we only talking about the local artifact cache, or does this extend to the full shared artifact cache?
> Let us step back on what the user is actually trying to accomplish here, and verify whether tab completion based on $project/$element/$cache_key is valid/useful.
>  
> >   * The `bst artifact list` command
> >
> >     Similarly to the above, I would very much like a UX where I do:
> >
> >     - bst artifact list gno<TAB>
> >     - bst artifact list gnome-build-meta/core-<TAB>
> >     - bst artifact
> > list gnome-build-meta/core-deps/*
> >
> >       View all the artifacts of all the elements in the core-deps/
> >       subdir of the gnome-build-meta project.
> >
> >       Showing the sizes of the artifacts, their creation dates,
> >       ordered by creation date
>
> What do you then use this information for?
>  
> >     - bst artifact list gnome-build-meta/core-deps/WebKit/*
> >
> >       View all of my WebKit artifacts, quickly see which is the
> >       last, second last WebKit artifact I built, compare those, etc.
>
> Ok, there's a use case.  You want to compare sizes/content of the latest and second to latest?

Yes, I want my artifacts at my finger tips, so I can extract and boot
them in a VM (if they are system images), or install them to a system
(if they are packages), I want to bisect my history of artifacts, save
them, compare them, their logs, etc.

But there's a catch there.  That bisection would be based on either time, or on project state.  In the former case there are no guarantees to the ordering of different versions of the project/element.  That is, the latest in the artifact cache might be a build of a year old version.
In case of bisecting on project state, that means you're looking up by element, and the system knows the $cachekey.

> >   * Uniqueness... Is it really a concern ? I feel now well enough
> >     advised to say if it is.
> >
> >     If we want to say that every artifact *can* be addressed by
> >     only it's cache key, they we are effectively saying that every
> >     build input combination can be safely identified as unique, I
> >     don't feel qualified to say if this is good enough and welcome
> >     input from others.
>
> Same here.

Right, keeping in context here that; I still think the project name and
element name needs to be considered in the generation of the cache key
if we're going to stop namespacing here; because even if the outputs
are identical, the artifact itself should not be assumed to be.

That said, if we're going to start considering project name and element
name in the cache key, it's probably better to just keep storing the
artifacts under their namespaced keys as usual.

> >     Note that I have made comments to the contrary here:
> >
> >         https://gitlab.com/BuildStream/buildstream/issues/569
> >
> >     Traditionally, people use an sha256sum to validate that the
> >     tarball downloaded from a given URL is in fact what we expect
> >     it to be.
> >
> >     Saying that "An sha256sum is good enough to uniquely identify
> >     any blob which could potentially be downloaded from the entire
> >     internet" I feel very strongly is breaking that model.
>
> The scope is limited to the artifact cache that you are using.  And
> whether this artifact cache is shared between projects or not is up
> to the projects.
>  
> >     I would say that if we did that for tarballs we would have to
> >     do it for git commit shas as well; and while I can believe that
> >     a commit sha is enough to identify uniquely every commit in the
> >     Linux kernel; carrying that over to say it can uniquely identify
> >     every commit in every git repository in the history of git, is
> >     another question entirely.
> >
> >     It's a separate conversation I admit, but I feel this is quite
> >     related.
>
> I believe this is indeed separate.  I think what you are referring to
> is whether configuration of sources can be considered the same when
> just the refs match?
>
> I would argue that if the content of the staged sources and
> dependencies, and the configuration of the element are the same, then
> the output should be reusable.

Right, and I am not qualified to say that:

   "The sha256sum checksums we use are enough to uniquely identify a
    tarball or any blob of data which could ever possibly be downloaded
    from any random location on the internet"

In practice we normally say that:

   "The sha256sum checksum is enough to validate that a tarball
    downloaded from this exact URL is the exact tarball we're looking
    for"

We already weaken this statement with the aliases just a bit, but
keeping the project itself in context at least ensures that the
checksums we use, are used to check data downloaded from locations
which the project authors intend.

I do believe we should be able to say this staged source as identified by this merkle tree (source key) is unique.  In remote execution we do rely on inputs to not collide.  The digests there are effectively a combination of hash and file size, which reduces the chance of collision.

> > > >   o Add the `bst artifact` subgroup with the following commands:
> > > > 
> > > >     o bst artifact list <artifact name glob pattern>
> > > > 
> > > >       By default listing only the artifacts which match the glob
> > > >       pattern.
> > > > 
> > > >       An additional `--long/-l` argument can get us more human
> > > >       readable information, similar to what was outlined here[2]
> > > > 
> > > >       List the artifacts in order of build date, displaying the build
> > > >       date in local time along with the size of the files portion of
> > > >       the artifact, and the active workspace if it was created locally
> > > >       with a workspace.
> > > 
> > > Same as above.
> > > 
> > > >     o bst artifact log <artifact name or pattern>
> > > > 
> > > >       Display the artifact build log in the $PAGER, similar to how we
> > > >       implement the "log" choice on the prompt of a failing build.
> > > > 
> > > >       If a pattern is specified then the log of each will be sent to
> > > >       the system pager in series (matching to the behavior of
> > > >       specifying wildcards to programs like "less").
> > > 
> > > I would say that $(cache_key) should be valid here too.  Separately
> > > the artifacts should not need to be local to run this operation.
> > >  
> > > >     o bst artifact delete <artifact name or glob pattern>
> > > > 
> > > >       Delete artifacts from the local cache which match the pattern
> > > > 
> > > >     o bst artifact list-content <artifact name or glob pattern>
> > > 
> > > I would expect $(cache_key) to work here as well.
> >
> > Right, for all of these points, we should decide whether we really want
> > to weaken this; I don't mind if we do.
> >
> > However, I feel pretty strongly that from a UX perspective, we should
> > be able to use "artifact names" as described in my proposal as well.
>
> Let's make sure that we consider operating against a remote artifact cache as well as a local one.

Good point, I wasn't considering that people might want to operate
directly on remote artifacts, list them and such, but I am sure they
will want to do it.

Assuming that bst artifact pull is relatively expensive, I assume that querying metadata on a remote artifact cache is indeed going to be a common pattern.

I don't think this changes anything though, as an artifact is the same
whether it is local or remote.

Probably we want implementation of remote capable artifact commands to
be implemented separately, it may effect the Client <--> Server
protocols too.

I would suggest that:

  * Any of the `bst artifact` subgroup commands operate on the local
    cache by default.

I'm not sure about that.  Do we consider the project and everything we know about it (including the shared artifact cache), or do we only consider local state?  I wonder if we're introducing asymmetry by defaulting to local when a remote artifact cache is configured.
 
  * Any of the `bst artifact` subgroup commands operate in exactly the
    same way on any explicitly specified remote, using a `--remote`
    argument to specify it's URL.

That is fair, so effectively being able to query a remote artifact cache without having any project context even.
 
Where the `--remote` thing can be implemented as a separate activity;
having it implemented locally first means that BuildStream on the
server side is already prepared to handle a request for a remote
client.

I'm missing something here I think... how do you figure "BuildStream on the server side is already prepared..."?  To know whether we can actually support our UI, I believe we need to investigate what ArtifactCache service API is feasible that would support it.

Cheers,
    -Tristan

Cheers,

Sander 
--

Cheers,

Sander


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]