Re: [BuildStream] Proposal: bst artifact subcommand group



On Mon, 2018-09-10 at 17:01 +0200, Sander Striker wrote:
Hi,

Apologies for the overdue response to this thread.


I have also fell behind on speccing this out in issues, it keeps
getting pushed down my TODO list.

On Tue, Jul 24, 2018 at 12:48 PM Tristan Van Berkom via Buildstream-list <buildstream-list gnome org> wrote:
Hi all,

There are a few functionalities which have been requested for some
time, such as viewing the build log of an artifact[0], or deleting an
existing artifact from the local cache.

It was recently raised that we cannot determine the build date of an
existing artifact either[1], as it's not a part of artifact metadata,
this discussion evolved to proposing that we have a subgroup of CLI
commands for dealing with specific artifacts.

As such I'd like to propose the following, probably uncontroversial
enhancements:

  o Add the build date to the artifact metadata, as a timezone neutral
    unix timestamp (seconds since epoch in UTC).

  o Standardize on the name of an artifact, this is not really exposed
    in the UX except the user may see "artifact names" at times.

    An artifact name is essentially:

       ${project-name}/${element-name}/${cache-key}

    Exposing this to the user essentially means that we have a handy
    syntax for the user to express an artifact on the command line.

I am assuming the $(cache_key) is actually unique enough on its own? 
And this name is purely symbolic, contextualized for a project? 
Nothing would require this under the hood?  As in, I wouldn't expect
a remote artifact cache to carry the $(project-name)/$(element-name)
as part of its keys(?).

Since the beginning, the refs we use to store / retrieve an artifact
have always been namespaced as:

  ${project-name}/${element-name}/${cache-key}

Whether to change that is an interesting discussion indeed.

First, I should say that it is entirely possible, likely even, that two
differing artifacts carry the same key as things currently stand.

This is because we never considered the element name in the key, so two
elements with the same inputs from different projects for instance, can
 produce the same cache key, without this namespace we risk mixing
those different artifacts (can have bad consequences, the logs might be
wrong, or artifact metadata might yield incorrect things).

That said, at this point I don't really feel strongly about this design
point, except that I find it impractical from a UI perspective to use
cache key only instead of full artifact names.

Since we have not introduced this lookup key in any public API yet, now
is a good time to have this discussion - changing this detail mostly
means that we need to ensure it's changed in all the right places and
bump the base artifact version for it.

Maybe there is an opportunity to include the element name in the cache
key algo, and *also* support lookup by full artifact name.

I have a few considerations here:

  * Tab completion

    I would expect a user to not have too many artifacts for a given
    element, and I would expect the CLI to autocomplete artifact names
    while typing.

    With YBD I recall, since we were storing artifacts as tarballs,
    we had the tarballs as ${cache-key}.tgz in subdirectories named
    after the things they built.

  * The `bst artifact list` command

    Similarly to the above, I would very much like a UX where I do:

    - bst artifact list gno<TAB>
    - bst artifact list gnome-build-meta/core-<TAB>
    - bst artifact
list gnome-build-meta/core-deps/*

      View all the artifacts of all the elements in the core-deps/
      subdir of the gnome-build-meta project.

      Showing the sizes of the artifacts, their creation dates,
      ordered by creation date

    - bst artifact list gnome-build-meta/core-deps/WebKit/*

      View all of my WebKit artifacts, quickly see which is the
      last, second last WebKit artifact I built, compare those, etc.

  * Uniqueness... Is it really a concern ? I feel now well enough
    advised to say if it is.

    If we want to say that every artifact *can* be addressed by
    only it's cache key, they we are effectively saying that every
    build input combination can be safely identified as unique, I
    don't feel qualified to say if this is good enough and welcome
    input from others.

    Note that I have made comments to the contrary here:

        https://gitlab.com/BuildStream/buildstream/issues/569

    Traditionally, people use an sha256sum to validate that the
    tarball downloaded from a given URL is in fact what we expect
    it to be.

    Saying that "An sha256sum is good enough to uniquely identify
    any blob which could potentially be downloaded from the entire
    internet" I feel very strongly is breaking that model.

    I would say that if we did that for tarballs we would have to
    do it for git commit shas as well; and while I can believe that
    a commit sha is enough to identify uniquely every commit in the
    Linux kernel; carrying that over to say it can uniquely identify
    every commit in every git repository in the history of git, is
    another question entirely.

    It's a separate conversation I admit, but I feel this is quite
    related.

  o Add the `bst artifact` subgroup with the following commands:

    o bst artifact list <artifact name glob pattern>

      By default listing only the artifacts which match the glob
      pattern.

      An additional `--long/-l` argument can get us more human
      readable information, similar to what was outlined here[2]

      List the artifacts in order of build date, displaying the build
      date in local time along with the size of the files portion of
      the artifact, and the active workspace if it was created locally
      with a workspace.

Same as above.

    o bst artifact log <artifact name or pattern>

      Display the artifact build log in the $PAGER, similar to how we
      implement the "log" choice on the prompt of a failing build.

      If a pattern is specified then the log of each will be sent to
      the system pager in series (matching to the behavior of
      specifying wildcards to programs like "less").

I would say that $(cache_key) should be valid here too.  Separately
the artifacts should not need to be local to run this operation.
 
    o bst artifact delete <artifact name or glob pattern>

      Delete artifacts from the local cache which match the pattern

    o bst artifact list-content <artifact name or glob pattern>

I would expect $(cache_key) to work here as well.

Right, for all of these points, we should decide whether we really want
to weaken this; I don't mind if we do.

However, I feel pretty strongly that from a UX perspective, we should
be able to use "artifact names" as described in my proposal as well.

 
      This can be useful for scriptability purposes, where one wants to
      generate a manifest of what an artifact contains, or simply for
      a curious user to see what files an artifact contains.

      Similarly, this should have `--long/-l` options to show detailed
      information about the files in the artifact, such as
      user/group/everyone permission bits, ownership bits, file size,
      etc (I think `tar -t` offers this with a `-vv` switch or such).

    o bst artifact diff <artifact name> <artifact name>

      Show differing added/removed and differing files in two artifacts

Beyond this:

  o It will be interesting to allow artifact operations to specify
    elements instead of artifact names, for the cases where you just
    want to use the artifact who's cache key corresponds to the project
    state.

+1.  That seems like a useful convenience indeed.

  o It would also be interesting to move `checkout`, `push` and `pull`
    commands under the new `artifact` group (deprecating the existing
    commands).

    This will open up the door to performing checkouts etc at the
    artifact level instead of only supporting the artifacts who's
    cache keys correspond to the project state.

+1.  In additional it should allow for commands against the remote, beyond push and pull?
 
If nobody opposes the proposal, I will go ahead and roadmap this by
creating a flagship issue on gitlab with a task list, each task
pointing to individual separate issues for the individually proposed
commands.

The bit that is missing is exposing the provenance data.  A command such as

  bst artifact origins $(cache_key)

I believe the minimum information that should be captured is:
- what sources, and other artifacts went into the build of this artifact
- whether the artifact was derived from a remote execution action, and if so, what the action_key (or url?) 
was

I think that these fall into two categories, and the `bst artifacts`
command subgroup might address either one or both categories, in a way.
If we add a way to extract information about the Sources used in an
artifact, we also want one to extract information about the Sources
used in a given project state.

I.e. anything to do with how the artifact was built, any metadata that
we could ever potentially want to encode into artifact metadata, should
be reachable with a `bst artifact show` command which:

 * Has an extensible `--format` option for new fields.
 * Is pretty symmetrical with `bst show` for elements.

For the sources, things are a bit more complicated, this needs thought.

 * Ideally we find a way to extend `bst show` to show relevant
   information about the sources in your project state.

   And we carry this on symmetrically for `bst artifact show`

 * Less than ideally, we have a separate command for showing sources.

The problem with sources is that they are 0-N for a given element or
artifact - whereas the `bst show` semantics are amenable to retrieving
fields of a given record (be it an element or an artifact).

Do you have some ideas how we could practically extend these commands
to retrieve source related informations ?

Note that extracting information about Sources might be tricky and
limited to extending the Source API to allow the core to extract
further generic information about a Source, this is true whether we are
displaying it for project state, or encoding it in metadata for later
retrieval from an artifact.

However I consider this an implementation detail - we'll need to figure
it out.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]