Re: [BuildStream] Protect against plugin modifications of artifacts
- From: Tristan Van Berkom <tristan vanberkom codethink co uk>
- To: William Salmon <will salmon codethink co uk>
- Cc: dev buildstream apache org, buildstream-list gnome org
- Subject: Re: [BuildStream] Protect against plugin modifications of artifacts
- Date: Wed, 24 Jun 2020 17:19:06 +0900
Hi Will,
So I looked at the OCI plugin for the first time today fwiw[0], I
think that we can all agree that one can not possibly have any
expectation for this code to really be reproducible. This is using
various host side constructs directly to create artifact output,
including delegation of complex operations to `tar` and `gzip` on the
host.
Again, this is only an example, even if it were to be proven that every
possible implementation of `tar` and `gzip` used by the given python
libraries in use were to produce deterministic results; the policy that
is at the heart of the BuildStream design is to assume that tooling is
not deterministic, and that in order to produce deterministic build
results, one must at least consider the tooling as a part of the input
(i.e. guarantee that the exact binaries and configuration of the
tooling in use to produce output, is considered in that output's cache
key).
The fact that some plugins are breaking this policy by composing output
and placing it in the sandbox manually, using host tooling, is breaking
this policy, a policy which is the foundation on which we have built
and earned the trust of our user base.
I will now reply to your latest comments.
"If plugins are not repeatable because they are write in python then
surely we should not have our plugins written in python? Plugins
affect the repeatability in may ways and if they them selves are not
repeatable then surely this is a different issue and needs
addressing urgently!
Once our plugins are repeatable then I don't see a issue with them
adding to the sandbox through a strict API. So long as for a given
input they create the same cache key and they produce the same
repeatable output."
While the gist of what you say here resonates strongly, the overall
statement is misguided in a variety of ways. I will patiently try to
enumerate these as clearly as I can.
Core data flow design and trust model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Setting aside for a moment the fact that BuildStream offers a plugin
API at all, or how the overall implementation works, let's look at a
few of the basic core principle in play:
* We take it as a plain fact that host tools are a variable, and as
such we never trust these tools to produce output.
* All data processing is performed within an isolated sandbox
environment where we can guarantee the tooling is a constant.
* In order to obtain external data, such as source code and initial
base runtime libraries, we have no choice but to relinquish
control to host tooling.
Even if we were to sandbox the tooling used to obtain external
data for the build sandbox, we need to have networking and
relinquish some measure of trust to external services.
This is the weak link, and as such we perform as much validation
as we can to ensure correct base inputs (Sources as we now know
them in BuildStream).
Plugin determinism and trustability
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Determinism vs trustability for plugins is not a zero sum game,
inasmuch as having non-deterministic plugins running code on the host
does not detract from their trustability per se.
For instance, the fact that BuildStream might build one element
before another depending on system load, or if one log message
appears before another non-deterministically, does not have any
effect on how artifacts get constructed within the sandbox.
A lot of this depends on what we trust plugins *for*, i.e. what kind
of workloads we delegate to them.
For BuildStream Elements, a simple summary of what tasks we delegate
is:
* Call BuildStream APIs to stage Sources and Artifacts to locations
of the Element's choosing
* Parse configuration
* Run commands within the Sandbox
* Generate a unique key describing it's configuration, which must
capture everything about how it will behave, for every way it
can behave differently.
NOTE: We delegate this responsibility to plugins _only_ because
we are not using a declarative plugin description language.
With a turing complete language like Python, it is not in
our power to derive such a unique key without actually
running the code. Ideally this would not be so (more in the
next section).
To summarize this point, Plugins not being deterministic in the way
they execute code, does not in itself mean that the outputs of the
pipeline are non-deterministic.
Why use Python for plugins ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
As I've outlined in the previous point: Not because it cannot be
trusted to generate deterministic output on it's own.
However, I'm happy you raised this because it is an interesting
point.
Ideally, we should be using a completely declarative description of
how a plugin works, rather than a turing complete programming
language which implements things in terms of running procedural code.
If we had a declarative plugin description language instead:
* We would have 100% control of the implementations of everything a
plugin does, the chances of a plugin being buggy in such a
scenario are slim to none (either the declaration of the plugin
is loaded successfully, or an error message is issued at load
time).
* We could indeed allow for constructs where for instance, static
content, or dynamic controlled content (like a %{variable}) could
be placed into the sandbox as a given file.
* Our capacity for long term support would be greatly increased
(in the current climate, we may be forced to distribute a python3
interpretor in the future, in order to support exactly the same
plugins written today, in 10, 20 or 30 years from now).
* As stated in the previous section, we would not have to trust
plugins to implement `Plugin.get_unique_key()`, since parsing the
plugin declaration would give the core all the knowledge it would
need to compose such a key.
Of course, this would still preclude turing complete code using host
libraries and dynamically generating content to be placed in the
sandbox (like the collect_manifest and oci plugins do).
For a straight answer to the question "Why Python" in the first
place, this was basically a matter of cost, the design and
implementation would have taken a lot more time, we chose Python for
it's ease of use and ease of integration in order to meet immediate
targets (something that works well enough in the first 6 months of
development).
If there is ever a BuildStream 3, it's main driver might be to switch
the plugin implementations into something declarative which is
entirely under BuildStream core control.
With all of that out of the way, I'd like to separately reply to this:
"A repeatable plugin should not have a issue with writing to the
sandbox. So long as what is written is completely determined by the
yaml in the project (element.bst + project.conf + dep.bst) and is
thus repeatable and deterministic."
So, to reuse a term which I came up with in the above text,
"dynamic controlled content" for lack of a better term, could perhaps
simply be a resolved %{variable} (which could contain a lot of text and
act like a template, with conditionally resolved variables substituted
within).
If this is such an interesting feature to have, I don't see much reason
why we could not implement such a feature in BuildStream, even using
Python plugins. This could be an Element API which takes a Sandbox, a
"%{variable}" name and an absolute path, which could stage a resolved
variable as the content of a file in the sandbox safely.
It would be interesting to see a proposal for such a feature, probably
I would argue that the permissions used to stage such a file be very
limited, or unspecified (something matching the hard coded permissions
used to stage files from Sources into the sandbox).
That said, a controlled feature like this would be an extremely far cry
from allowing python code to simply write whatever they want into the
sandbox, and would not allow for the non-deterministic things which are
currently being done by existing plugins which exploit this currently
existing weakness.
This took me a while to write today, I hope this has achieved it's goal
in being as informative as I hoped it would be.
Best Regards,
-Tristan
[0]:
https://gitlab.com/BuildStream/bst-plugins-experimental/-/blob/master/src/bst_plugins_experimental/elements/oci.py
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]