Re: [BuildStream] Protect against plugin modifications of artifacts





On 23/06/2020 14:04, Tristan Van Berkom wrote:
One more round...

On Tue, 2020-06-23 at 11:12 +0100, William Salmon wrote:

[...]
Of course you need to stage the tooling which is used to create output,
that is because BuildStream intentionally enforces that you always use
deterministic input, and the exact build/copy of the binaries used to
create output, is a part of the input.

   > The fact that generating data in python is problematic is not a
reason to avoid fixing illegal writing to the sandbox (which probably
mostly happens *due* to the latter problem), its a reason to ensure that
we *also* ensure that doesn’t happen.
   >

While the addition may have been a error in your eyes, If writing to the
sandbox was not public API I would have be arguing that it should be.
For the reasons above.

It was most definitely an error.

The plugins need to have a way to stage files in locations of their
choosing, before the vdir abstraction was in place, the only way to do
this was by providing a directory argument.

This was abused, and has now lead to generation of artifact data which
is generated non-deterministically, without controlling the inputs of
this output, which is at the heart of BuildStream's promise.

I feel you are conflating two things, are plugins `non-deterministic`
and should we let plugins alter whats in the sandbox. The answer to
should plugins be deterministic should clearly be yes and the answer to
should plugins effect what happens in the sandbox seem clearly yes. If
plugins are non deterministic then we need to fix that, plugins effect
the build in a number of ways and if we cant trust plugins then we cant
trust anything about bst as every element is build with a plugin.

Once plugins are trust able, and they must be or the hole concept of
cache keys falls apart. Then your hole argument for why we cant put
things in the sandbox falls apart, given that we must fix plugins so
they are trust able then I fail to see a issue with plugins putting
things in to the sandbox.

There are two things you appear to be conflating, which is stability
and correctness of cache keys, and reproducibility of build artifacts.


This whole discussion is about reproducibility, not about cache key
composition (although it was raised as an orthogonal concern, it is not
centric to why we don't use plugins to create output).

For reproducibility, the plugins cannot reasonably be trusted to
compose data reproducibly, the premise of creating reproducible output
in artifacts is to use deterministic inputs: The host version of
installed python, plus the versions of any python libraries which are
running, are not deterministic.

BuildStream ensures as a promise, that output is based on deterministic
inputs, do not conflate this with the calculation of cache keys which
BuildStream uses to identify the inputs.

I've already demonstrated in my first reply to you an example of how
the output of collect_manifest is already non-reproducible, precisely
*because* the host version of python will have an effect on it's
output, I emphasized that this was an *example* (not an invitation to
haggle over whether 'dict' can be trusted in the future, there is no
guarantee that it can), that said, it was a single example to
demonstrate that host python *matters* when constructing output.

Keep the following things in mind:

   * An important set of BuildStream users will be striving to produce
     reproducible output.

     If they achieve reproducibility today, they should be able to take
     the exact same project state, and use the latest version of
     BuildStream 2, in 10 years from now... and they should be able
     to produce the exact same output, bit-for-bit.

   * While BuildStream all by itself cannot guarantee reproducible
     output, our role is to guarantee that even after upgrading your
     host BuildStream installation in 10 years from now, BuildStream
     will repeat the build in *exactly the same way*.

The goal of deterministic building is entirely centric to the
BuildStream mission, and cannot be derailed by a desire to have things
just a little bit more convenient.

I'm sorry if I have, but I have at no point meant to have said that bst should not be deterministic or repeatable. My point about cache keys is that for a given set of inputs we should have a consistent cache key and we should always get the same output, ie. the build should be repeatable.

If plugins are not repeatable because they are write in python then surely we should not have our plugins written in python? Plugins affect the repeatability in may ways and if they them selves are not repeatable then surely this is a different issue and needs addressing urgently!

Once our plugins are repeatable then I don't see a issue with them adding to the sandbox through a strict API. So long as for a given input they create the same cache key and they produce the same repeatable output.

In my mind these are two distinct issues and I do not understand why they need to be coupled:

* Plugins must be repeatable in all that they do or bst can not claim to be repeatable.

* A repeatable plugin should not have a issue with writing to the sandbox. So long as what is written is completely determined by the yaml in the project (element.bst + project.conf + dep.bst) and is thus repeatable and deterministic.



So, please keep in mind, plugins are not there to create output, there
is no way we can make plugins trustable for creating output, because we
cannot be in control of the environment in which plugins run.

Cheers,
     -Tristan





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]