Re: [BuildStream] Protect against plugin modifications of artifacts



Thanks for chiming in Ben,

Some replies inline.

On Fri, 2020-06-26 at 08:52 +0000, Benjamin Schubert wrote:
Hey all,

This is a lot of mails to respond inline, so I will rather answer as a single
email. Please excuse the length of it.

TLDR
====

I think we should be closing the API to prevent writing in the sandbox, but I
think we need other construct beforehand, and to rethink how impacted plugins
actually should work in BuildStream.
A potential way forward I can see is having a 'artifact' source kind, more
in details in the mail.


I'm not fond of this idea at all, or the implications it might lead to.

First of all, treating artifacts as Sources might be backdoored as an
excuse to modify otherwise safely constructed output in artifacts,
using host python tooling.

Sources have a heavy responsibility to stage *exactly bit for bit* what
is represented by the ref, the same ref for a given source should never
ever result in anything different being staged.

So when we talk about sources like `tar`, we are extracting content and
the output is guaranteed to be the same, even if the tar files were
constructed non-deterministically at their origins.


Scope
=====

First of all, in order to ensure I correctly understand the premises of this
thread:

The aim is to more specifically remove the `Directory.open_files()` call for
writing files in the directory. Is that accurate? Or is there more things that
would disappear?


What would it affect?
=====================

A quick look around the plugins shows that the following plugins would be
affected by such a change:

bst_plugins_experimental/elements/bazelize.py
bst_plugins_experimental/elements/collect_integration.py
bst_plugins_experimental/elements/collect_manifest.py
bst_plugins_experimental/elements/dpkg_build.py
bst_plugins_experimental/elements/dpkg_deploy.py
bst_plugins_experimental/elements/flatpak_image.py
bst_plugins_experimental/elements/oci.py
bst_plugins_experimental/elements/tar_element.py
bst_plugins_containers/elements/docker_container.py

So this seems to be more than "a few" plugins and most of those seem to have
a thing in common: their dependencies are actually part of their output,
in a transformed way. I'll come back to that later.

For the sake of concentrating more on the actual problems, let's cross
at least one of these off the list:

collect_manifest cannot be done
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The `collect_manifest` cannot possibly be implemented under current
BuildStream API, *at all*, there is absolutely no valid way of
supporting this plugin, so it's best to not even discuss this.

This is explained at:

    https://gitlab.com/BuildStream/bst-plugins-experimental/-/issues/2

I still feel this kind of manifest construction is better suited to a
script which calls `bst artifact show --format "..."` for various
artifacts, extracting and parsing relevant data to construct a
manifest.

Whatever the approach could possibly be for this brand of plugin, it
cannot possibly be implemented at all without first augmenting the
Source API contract in some way to allow Source objects to report
something standardized about their `url` and `ref`.


What should we trust in the sandbox and plugins?
================================================

It is mentionned in this thread that we do not trust python's APIs to run
repeatably for plugins. However we do rely on it for BuildStream itself, and
there is little we can do around that area of trust.

I do agree that we should do our best to prevent plugin authors from shooting
themselves in the foot and to help them achieve the most repeatable build
possible. However I don't think there is a difference between trusting Python
for us, and trusting python for the plugins.

I'm not sure I understand the difference of "for us and for plugins".

From my interpretation, I think you mean:

   * In the core we have python code which writes to the sandbox, for
     instance Element.stage().

   * Would we trust a plugin with this power ?

Of course not, over our dead bodies.

Every bit of trust that we relinquish to a Plugin is forcefully pried
from our cold dead fingers. Ideally, we do not trust plugins at all for
anything: Plugins only exist to chose how the core is to operate on the
given inputs, absolutely nothing more.

Why ?

Because in this way, we can provide real guarantees, if things are
buggy, we can fix them in the core.

Plugins are supposed to have little to no power to get anything wrong,
they can be whipped up on the fly for a given use case, and live in
someones repository for years, miles away from our reach and ability to
fix.


Asides from the above, I should also stress, we do not trust
*ourselves* to write data into artifacts in the core either. Giving
plugins opportunity to compose output in the sandbox (by any means) is
giving plugins a level of trust that we do not give ourselves.

Every manipulation of the sandbox we do in the core, is to preserve the
data as it was created by the sandbox, and ensure the determinism of
everything (e.g. setting deterministic mtimes, ensuring umask
consistency when staging sources, etc).


What way forward?
=================

I agree that having plugins being able to write any files in a sandbox is not
the best and prevents us (BuildStream core) from making any assumption.

There seems to be a need for not executing code in the sandbox but rather
outside of it though. Based on the plugins above, I will try to go deeper and
explain what I see as common problems they have and alternative ways to allow
them to continue doing their work.

All the plugins above have a common point: they treat their dependencies as
their input, as other elements would treat their sources. As a brief summary,
here is the list of operations they are doing:

All element treat their dependencies as their input.

If elements want to process other dependencies independently, that's
exactly why they are allowed to call Element.stage_artifact(), and use
Element.search() to obtain specific elements to stage in the sandboxes
so that they can be independently processed.

I don't see why we would need any additional API for any of this,
although Element.search() needs to be fixed for cross-junction lookups,
that's already on my immediate todo list.

- Archiving / checksuming files and writing that in the sandbox
- Writing a manifest of some kind

Having them run script inside the sandbox means they would have a harder time
knowing what is their input and what is there to allow doing their job, and we
currently don't have any construct in BuildStream for such things.

Again, with Element.search() and Element.stage_artifact() should be
sufficient to stage elements at locations if they need to be
independently processed, using other tooling staged at "/" to process
this and run scripts.

It can be that we need a core facility to expose artifact public data
deterministically in some way(s) for the sake of processing with
scripts inside a sandbox.

This seems to show to me that there is a need for an `artifact` source kind,
which would be the output of a previous element. I will start another
thread to keep things separate as to how this would work, but I believe this
would solve the first part of our problem, that is, be able to separate build
dependencies from "sources".

The second part would then be, for plugins that are 'packaging', like docker or
oci, to rewrite the element to work on the `artifact` source instead. And for
other elements, like `bazelize`, to actually be a `SourceTransform` based on
the previous elements.

A brief example of how I could see that working (More in the next ML thread
I'll start after this, let's not discuss the details here too much unless
relevent)

For an element, we could have as sources:

```
kind: oci_image

sources:
- kind: artifact
  elements: my_stack_for_which_to_create_a_oci_image
- kind: manifesto  # Write a manifesto in the sources based on previous sources

build-depends:
- tar (or buildah)
```

I believe that such a new construct would allow BuildStream plugins to achieve
all they achieve today without writing to the sandbox directly for elements.

I also think that we should potentially try this first before closing the API
alltogether on a few plugins.

Coming back to the beginning of my reply, I feel very strongly against
this if I'm interpreting this correctly. This implies host side
processing and manipulation of data which will make it's way to an
artifact, which is equally as incorrect as writing directly into the
sandbox.

Importing sources is the weakest link, it needs to remain as brief as
possible, and do the best job of simply ensuring bit-for-bit equal
staging for any two times it bares the same ref (it should definitely
not be doing the stuff that is going on in `oci` and `docker` plugins
currently, which I consider rather reckless for BuildStream).

There is no reason I can see why oci and docker plugins cannot be
implemented safely using the sandbox, it's very simple; they simply
need to stage tooling to operate on the data, just like we stage
autotools (except possibly in a different sysroot).


Summary
=======
I think that for the summary of plugins which currently exploit the
directory API to write data to the sandbox, things fall into a few
categories:

  A) Stuff that we can totally do in the sandbox, but people didnt
     do because they were lazy or wanted something a bit more
     convenient to work with than BuildStream.

     E.g. oci, docker, tar - these are all expected to be implemented
     by staging the required tooling into "/" and processing in the
     sandbox.

  B) Stuff that cannot be implemented at all in BuildStream.

     E.g. collect_manifest, this is just a plain hack which cannot be
     supported at all without designing some way for BuildStream to
     deliver the data which `collect_manifest` desires.

  C) Possibly metadata for package managers like rpm or dpkg.

     In this case we might be able to create some helper facility to
     safely create a file based on a %{variable}, in order to avoid
     having to use a shell command like `cat > file <<EOF ...` and
     the escaping which this entails.


Cheers,
    -Tristan




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]