Re: [BuildStream] Proposal: bst source delete command



Hi Tristan, Jürg,

Thanks for the prompt replies. Some comments inline.

On Sat, Feb 22, 2020 at 9:29 PM Jürg Billeter <j bitron ch> wrote:

On Sat, 2020-02-22 at 10:02 +0900, Tristan Van Berkom via buildstream-
list wrote:
Hi Chandan,

On Feb 22, 2020, at 6:49 AM, Chandan Singh <chandan chandansingh net> wrote:

Hi all,

BuildStream currently offers no way of systematically deleting locally cached
sources. I think it should however do so - not only to maintain symmetry with
the `artifact delete` command but also because sources can be substantial in
size in many cases.

In this message, I propose to add a new `source delete` command.

Not really against this, but I’d like to point out that this is not a
usage pattern I’d want to require of normal users. The artifact
delete command was created mostly for the convenience of developers,
e.g to be able to easily rerun tasks in their plugin while developing
it.

The user on the other hand should simply be given the convenience of
BuildStream respecting the users decided cache quotas, and never be
pestered to manually type a command or asked to know which things
they’d like to delete.

From this perspective, I’d hope to see this as a power user feature
(which is fine to have I think).

I generally agree. However, one issue is that we only have a cache
quota for CAS. There is no expiry mechanism for the plugin-controlled
cache directories. To some extent this is intentional as, depending on
the source and the internet connection, fetching may be slow (and
sometimes not reliable).

`bst source delete` would indeed not be a solution to this, though. For
users that mostly fetch sources from a CAS-based remote source cache
(or Remote Asset server in the future), this may also not be a
significant issue. However, we probably still need a solution for this
at some point.

I was going to say the same thing that the plugin cache directories
are entirely controlled by the plugins themselves. So long as
BuildStream provides plugins a mechanism to have a non-CAS cache, we
need some way of cleaning it up.

`source delete` is not going to solve all of it, but it would surely
be better than `rm ~/.cache/buildstream/git/url_of_git_repo` that I
have to do currently in order to get rid of a large repository.

Unlike its `artifact delete` counterpart, this is slightly more nuanced as we
need to delete the sources from two places:

1. From the local CAS cache
2. From the plugin's cache directories

(1) is simple, but (2) would require cooperation from source plugins. To that
end, I propose to add a new abstract method for source plugins, like
`Source.delete()`.

I’m not convinced about this part.

This will require some API change support from the plugin, but
instead of adding more to plugins, I feel like we might get away with
being more strict with the plugin’s cache directories.

If we instead decide that plugins need to ask for a directory while
specifying the name of the source which can later be given to `bst
source delete`, then BuildStream retains ownership of the directory
and do what it pleases with it later, without asking the plugin to do
anything.

This sounds sensible. We already have `get_mirror_directory()` but we
currently leave the structure within that directory completely up to
the plugin. In most cases the plugin uses a subdirectory of the form
`utils.url_directory_name(url)`. One option could be to add an abstract
method `Source.get_url()` and then let BuildStream core construct the
full mirror path. This additional method may also be needed for the
second part of the Remote Asset API support. Details TBD.

I like the idea of being stricter with handing out cache directories
to plugins. But, I am not entirely convinced about adding something
like `Source.get_url()`. This is primarily because URLs aren't a thing
in our source API at present, other than the `translate_url()` helper.
Also, we don't restrict sources to have only one URL so it can get a
bit hairy in that case.

Alternatively, what about including the project and element name in
the path of the mirror directory? This would make the mirror directory
unique for each source type for a given element. This would have a
downside that sources will not be able to share the same cache
directory across element and project boundaries. On the upside, it
would make cleanup simpler.

I am not sure if having the same source in different projects/elements
is that common of a use case. My instinct says no, but don't have the
data to back it up myself. If this is acceptable, it would simplify a
few things. I'd be curious to hear what others think.

Open questions
--------------

1. Since source plugins do not know about how many times a a given source is
  used, they can't do reference counting of sources. So, imagine a scenario
  where elements A and B both use a git repository G. Using this approach,
  `bst source delete A` would end up deleting it, as we have no way of knowing
  that B is also using it. This may not be too bad but the user may have
  expected the source to persist if another element is still using it.

This is a good question, what exactly is `A` in `bst source delete A`
?

I wasn’t expecting that we’d be passing element names here, but how
can we address sources directly ?

I think for the power user feature it's fine to pass element names here
and delete the sources regardless of whether they are used by other
elements as well (they may also be used by other projects). As I see
it, the main purpose is to make it possible to force a refetch (for
debugging, development) or manually free up some space.

Sorry if I didn't make this bit clearer. A and B are both elements,
that may or may not be from the same project. I am not proposing to
allow referencing individual sources. `bst source delete A` would
delete all sources for element A. This is similar to how other `bst
source` subcommands work, i.e. they only operate at the element level.
For example, `source fetch` fetches all of them and `source track`
tracks all of them.

2. What to do with unreachable sources? How do we delete them? It is not
  inconceivable to imagine that certain URLs are no longer referenced. But,
  again, we don't have any way of ascertaining that.

This is related to my comment that we'll need a solution for the
missing expiry mechanism of the plugin-controlled cache. I don't think
`bst source delete` can solve that issue.

We could theoretically add a command to delete all unreferenced
sources. However, I can't think of a sensible interface for this when
considering that many users will work with multiple projects. Requiring
users to specify all project directories whose sources they wish to
keep doesn't seem ideal. That said, it may make sense to add this as
another power user feature.

Yeah, the fact that they can be referenced by multiple projects is
what makes this quite tricky. The alternative mirror directory
structure I proposed above would simplify this as there would be no
sharing across project boundaries. So, it might become easier to write
some sort of prune command.

Thanks!
Chandan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]