Re: [BuildStream] Proposal: bst artifact subcommand group

From: Sander Striker <s striker striker nl>
To: Tristan Van Berkom <tristan vanberkom codethink co uk>
Cc: BuildStream <buildstream-list gnome org>
Subject: Re: [BuildStream] Proposal: bst artifact subcommand group
Date: Wed, 12 Sep 2018 15:57:59 +0100

Hi,

On Tue, Sep 11, 2018 at 12:26 PM Tristan Van Berkom <tristan vanberkom codethink co uk> wrote:

On Mon, 2018-09-10 at 17:01 +0200, Sander Striker wrote

[...]

> I am assuming the $(cache_key) is actually unique enough on its own?

> And this name is purely symbolic, contextualized for a project?
> Nothing would require this under the hood? As in, I wouldn't expect
> a remote artifact cache to carry the $(project-name)/$(element-name)
> as part of its keys(?).

Since the beginning, the refs we use to store / retrieve an artifact
have always been namespaced as:

${project-name}/${element-name}/${cache-key}

Whether to change that is an interesting discussion indeed.

I'm merely thinking about potential reuse; and whether the namespacing is causing an artificial reason why cache results are not shareable.

First, I should say that it is entirely possible, likely even, that two
differing artifacts carry the same key as things currently stand.

This is because we never considered the element name in the key, so two
elements with the same inputs from different projects for instance, can
produce the same cache key, without this namespace we risk mixing
those different artifacts (can have bad consequences, the logs might be
wrong, or artifact metadata might yield incorrect things).

You're thinking of both artifacts already being produced individually, and now being joined in the same namespace?

I was thinking from the perspective of in new builds the build for the second element would see a cache hit, given the output is going to be equivalent?

That said, at this point I don't really feel strongly about this design
point, except that I find it impractical from a UI perspective to use
cache key only instead of full artifact names.

The other use case to consider is restructuring your element tree, either by renaming or moving the .bst file. If the inputs remain the same, there would be no need to rebuild. The artifact name now doesn't bear a relation to the actual element.

It appears we have two needs to satisfy here, and they may not have to be solved the same way:

1 - we need a way for our system to be efficient and effective in reusing artifacts that have the same inputs.

2 - we need a way for humans to make sense of artifacts outside of the context of a project/elements at a specific version.

For 1, the use of naked cache_keys seems the most obvious way to maximize this property.

For 2, having a UI that shows artifacts by project/element name, potentially ordered by time.

Note that this may mean that we have a need for multiple mappings, to get both.

Since we have not introduced this lookup key in any public API yet, now
is a good time to have this discussion - changing this detail mostly
means that we need to ensure it's changed in all the right places and
bump the base artifact version for it.

Maybe there is an opportunity to include the element name in the cache
key algo, and *also* support lookup by full artifact name.

I have a few considerations here:

* Tab completion

I would expect a user to not have too many artifacts for a given
element, and I would expect the CLI to autocomplete artifact names
while typing.

With YBD I recall, since we were storing artifacts as tarballs,
we had the tarballs as ${cache-key}.tgz in subdirectories named
after the things they built.

Why? If you know the cache_key you don't need autocompletion, unless you are typing it instead of copy-'n-paste. If you don't know the cache_key, then you probably want the artifact associated with your current project state. If not, then it really comes down to "just give me any artifact" for this project/element. I would imagine that "latest" would be a useful qualifier here.

Are we only talking about the local artifact cache, or does this extend to the full shared artifact cache?

Let us step back on what the user is actually trying to accomplish here, and verify whether tab completion based on $project/$element/$cache_key is valid/useful.

* The `bst artifact list` command

Similarly to the above, I would very much like a UX where I do:

- bst artifact list gno<TAB>
- bst artifact list gnome-build-meta/core-<TAB>
- bst artifact
list gnome-build-meta/core-deps/*

View all the artifacts of all the elements in the core-deps/
subdir of the gnome-build-meta project.

Showing the sizes of the artifacts, their creation dates,
ordered by creation date

What do you then use this information for?

- bst artifact list gnome-build-meta/core-deps/WebKit/*

View all of my WebKit artifacts, quickly see which is the
last, second last WebKit artifact I built, compare those, etc.

Ok, there's a use case. You want to compare sizes/content of the latest and second to latest?

* Uniqueness... Is it really a concern ? I feel now well enough
advised to say if it is.

If we want to say that every artifact *can* be addressed by
only it's cache key, they we are effectively saying that every
build input combination can be safely identified as unique, I
don't feel qualified to say if this is good enough and welcome
input from others.

Same here.

Note that I have made comments to the contrary here:

https://gitlab.com/BuildStream/buildstream/issues/569

Traditionally, people use an sha256sum to validate that the
tarball downloaded from a given URL is in fact what we expect
it to be.

Saying that "An sha256sum is good enough to uniquely identify
any blob which could potentially be downloaded from the entire
internet" I feel very strongly is breaking that model.

The scope is limited to the artifact cache that you are using. And whether this artifact cache is shared between projects or not is up to the projects.

I would say that if we did that for tarballs we would have to
do it for git commit shas as well; and while I can believe that
a commit sha is enough to identify uniquely every commit in the
Linux kernel; carrying that over to say it can uniquely identify
every commit in every git repository in the history of git, is
another question entirely.

It's a separate conversation I admit, but I feel this is quite
related.

I believe this is indeed separate. I think what you are referring to is whether configuration of sources can be considered the same when just the refs match?

I would argue that if the content of the staged sources and dependencies, and the configuration of the element are the same, then the output should be reusable.

> > o Add the `bst artifact` subgroup with the following commands:
> >
> > o bst artifact list <artifact name glob pattern>
> >
> > By default listing only the artifacts which match the glob
> > pattern.
> >
> > An additional `--long/-l` argument can get us more human
> > readable information, similar to what was outlined here[2]
> >
> > List the artifacts in order of build date, displaying the build
> > date in local time along with the size of the files portion of
> > the artifact, and the active workspace if it was created locally
> > with a workspace.
>
> Same as above.
>
> > o bst artifact log <artifact name or pattern>
> >
> > Display the artifact build log in the $PAGER, similar to how we
> > implement the "log" choice on the prompt of a failing build.
> >
> > If a pattern is specified then the log of each will be sent to
> > the system pager in series (matching to the behavior of
> > specifying wildcards to programs like "less").
>
> I would say that $(cache_key) should be valid here too. Separately
> the artifacts should not need to be local to run this operation.
>
> > o bst artifact delete <artifact name or glob pattern>
> >
> > Delete artifacts from the local cache which match the pattern
> >
> > o bst artifact list-content <artifact name or glob pattern>
>
> I would expect $(cache_key) to work here as well.

Right, for all of these points, we should decide whether we really want
to weaken this; I don't mind if we do.

However, I feel pretty strongly that from a UX perspective, we should
be able to use "artifact names" as described in my proposal as well.

Let's make sure that we consider operating against a remote artifact cache as well as a local one.

> > This can be useful for scriptability purposes, where one wants to
> > generate a manifest of what an artifact contains, or simply for
> > a curious user to see what files an artifact contains.
> >
> > Similarly, this should have `--long/-l` options to show detailed
> > information about the files in the artifact, such as
> > user/group/everyone permission bits, ownership bits, file size,
> > etc (I think `tar -t` offers this with a `-vv` switch or such).
> >
> > o bst artifact diff <artifact name> <artifact name>
> >
> > Show differing added/removed and differing files in two artifacts
> >
> > Beyond this:
> >
> > o It will be interesting to allow artifact operations to specify
> > elements instead of artifact names, for the cases where you just
> > want to use the artifact who's cache key corresponds to the project
> > state.
>
> +1. That seems like a useful convenience indeed.
>
> > o It would also be interesting to move `checkout`, `push` and `pull`
> > commands under the new `artifact` group (deprecating the existing
> > commands).
> >
> > This will open up the door to performing checkouts etc at the
> > artifact level instead of only supporting the artifacts who's
> > cache keys correspond to the project state.
>
> +1. In additional it should allow for commands against the remote, beyond push and pull?
>
> > If nobody opposes the proposal, I will go ahead and roadmap this by
> > creating a flagship issue on gitlab with a task list, each task
> > pointing to individual separate issues for the individually proposed
> > commands.
>
> The bit that is missing is exposing the provenance data. A command such as
>
> bst artifact origins $(cache_key)
>
> I believe the minimum information that should be captured is:
> - what sources, and other artifacts went into the build of this artifact
> - whether the artifact was derived from a remote execution action, and if so, what the action_key (or url?) was

I think that these fall into two categories, and the `bst artifacts`
command subgroup might address either one or both categories, in a way.
If we add a way to extract information about the Sources used in an
artifact, we also want one to extract information about the Sources
used in a given project state.

I.e. anything to do with how the artifact was built, any metadata that
we could ever potentially want to encode into artifact metadata, should
be reachable with a `bst artifact show` command which:

* Has an extensible `--format` option for new fields.
* Is pretty symmetrical with `bst show` for elements.

That sounds sensible. I think this brings back the --yaml discussion from the other thread :).

For the sources, things are a bit more complicated, this needs thought.

* Ideally we find a way to extend `bst show` to show relevant
information about the sources in your project state.

And we carry this on symmetrically for `bst artifact show`

* Less than ideally, we have a separate command for showing sources.

The problem with sources is that they are 0-N for a given element or
artifact - whereas the `bst show` semantics are amenable to retrieving
fields of a given record (be it an element or an artifact).

Do you have some ideas how we could practically extend these commands
to retrieve source related informations ?

Note that extracting information about Sources might be tricky and
limited to extending the Source API to allow the core to extract
further generic information about a Source, this is true whether we are
displaying it for project state, or encoding it in metadata for later
retrieval from an artifact.

However I consider this an implementation detail - we'll need to figure
it out.

+1.

Cheers,
-Tristan

Cheers,

Sander

Cheers,

Sander

Follow-Ups:
- Re: [BuildStream] Proposal: bst artifact subcommand group
  - From: Tristan Van Berkom

References:
- Re: [BuildStream] Proposal: bst artifact subcommand group
  - From: Sander Striker
- Re: [BuildStream] Proposal: bst artifact subcommand group
  - From: Tristan Van Berkom

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]