Re: [BuildStream] Protect against plugin modifications of artifacts

From: Tristan Van Berkom <tristan vanberkom codethink co uk>
To: William Salmon <will salmon codethink co uk>
Cc: dev buildstream apache org, buildstream-list gnome org
Subject: Re: [BuildStream] Protect against plugin modifications of artifacts
Date: Wed, 24 Jun 2020 17:19:06 +0900

Hi Will,

   So I looked at the OCI plugin for the first time today fwiw[0], I
think that we can all agree that one can not possibly have any
expectation for this code to really be reproducible. This is using
various host side constructs directly to create artifact output,
including delegation of complex operations to `tar` and `gzip` on the
host.

Again, this is only an example, even if it were to be proven that every
possible implementation of `tar` and `gzip` used by the given python
libraries in use were to produce deterministic results; the policy that
is at the heart of the BuildStream design is to assume that tooling is
not deterministic, and that in order to produce deterministic build
results, one must at least consider the tooling as a part of the input
(i.e. guarantee that the exact binaries and configuration of the
tooling in use to produce output, is considered in that output's cache
key).

The fact that some plugins are breaking this policy by composing output
and placing it in the sandbox manually, using host tooling, is breaking
this policy, a policy which is the foundation on which we have built
and earned the trust of our user base.


I will now reply to your latest comments.

  "If plugins are not repeatable because they are write in python then 
   surely we should not have our plugins written in python? Plugins
   affect the repeatability in may ways and if they them selves are not
   repeatable then surely this is a different issue and needs
   addressing urgently!

   Once our plugins are repeatable then I don't see a issue with them 
   adding to the sandbox through a strict API. So long as for a given
   input they create the same cache key and they produce the same
   repeatable output."

While the gist of what you say here resonates strongly, the overall
statement is misguided in a variety of ways. I will patiently try to
enumerate these as clearly as I can.


  Core data flow design and trust model
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Setting aside for a moment the fact that BuildStream offers a plugin
  API at all, or how the overall implementation works, let's look at a
  few of the basic core principle in play:

    * We take it as a plain fact that host tools are a variable, and as
      such we never trust these tools to produce output.

    * All data processing is performed within an isolated sandbox
      environment where we can guarantee the tooling is a constant.

    * In order to obtain external data, such as source code and initial
      base runtime libraries, we have no choice but to relinquish 
      control to host tooling.

      Even if we were to sandbox the tooling used to obtain external
      data for the build sandbox, we need to have networking and
      relinquish some measure of trust to external services.

      This is the weak link, and as such we perform as much validation
      as we can to ensure correct base inputs (Sources as we now know
      them in BuildStream).


  Plugin determinism and trustability
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Determinism vs trustability for plugins is not a zero sum game,
  inasmuch as having non-deterministic plugins running code on the host
  does not detract from their trustability per se.

  For instance, the fact that BuildStream might build one element
  before another depending on system load, or if one log message
  appears before another non-deterministically, does not have any
  effect on how artifacts get constructed within the sandbox.

  A lot of this depends on what we trust plugins *for*, i.e. what kind
  of workloads we delegate to them.

  For BuildStream Elements, a simple summary of what tasks we delegate
  is:

    * Call BuildStream APIs to stage Sources and Artifacts to locations
      of the Element's choosing

    * Parse configuration

    * Run commands within the Sandbox

    * Generate a unique key describing it's configuration, which must
      capture everything about how it will behave, for every way it
      can behave differently.

      NOTE: We delegate this responsibility to plugins _only_ because
            we are not using a declarative plugin description language.
            With a turing complete language like Python, it is not in
            our power to derive such a unique key without actually
            running the code. Ideally this would not be so (more in the
            next section).

  To summarize this point, Plugins not being deterministic in the way
  they execute code, does not in itself mean that the outputs of the
  pipeline are non-deterministic.


  Why use Python for plugins ?
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  As I've outlined in the previous point: Not because it cannot be
  trusted to generate deterministic output on it's own.

  However, I'm happy you raised this because it is an interesting
  point.

  Ideally, we should be using a completely declarative description of
  how a plugin works, rather than a turing complete programming
  language which implements things in terms of running procedural code.

  If we had a declarative plugin description language instead:

    * We would have 100% control of the implementations of everything a
      plugin does, the chances of a plugin being buggy in such a
      scenario are slim to none (either the declaration of the plugin
      is loaded successfully, or an error message is issued at load
      time).

    * We could indeed allow for constructs where for instance, static
      content, or dynamic controlled content (like a %{variable}) could
      be placed into the sandbox as a given file.

    * Our capacity for long term support would be greatly increased
      (in the current climate, we may be forced to distribute a python3
      interpretor in the future, in order to support exactly the same
      plugins written today, in 10, 20 or 30 years from now).

    * As stated in the previous section, we would not have to trust
      plugins to implement `Plugin.get_unique_key()`, since parsing the
      plugin declaration would give the core all the knowledge it would
      need to compose such a key.

  Of course, this would still preclude turing complete code using host
  libraries and dynamically generating content to be placed in the
  sandbox (like the collect_manifest and oci plugins do).

  For a straight answer to the question "Why Python" in the first
  place, this was basically a matter of cost, the design and
  implementation would have taken a lot more time, we chose Python for
  it's ease of use and ease of integration in order to meet immediate
  targets (something that works well enough in the first 6 months of
  development).

  If there is ever a BuildStream 3, it's main driver might be to switch
  the plugin implementations into something declarative which is
  entirely under BuildStream core control.


With all of that out of the way, I'd like to separately reply to this:

  "A repeatable plugin should not have a issue with writing to the 
   sandbox. So long as what is written is completely determined by the
   yaml in the project (element.bst + project.conf + dep.bst) and is
   thus repeatable and deterministic."

So, to reuse a term which I came up with in the above text,
"dynamic controlled content" for lack of a better term, could perhaps
simply be a resolved %{variable} (which could contain a lot of text and
act like a template, with conditionally resolved variables substituted
within).

If this is such an interesting feature to have, I don't see much reason
why we could not implement such a feature in BuildStream, even using
Python plugins. This could be an Element API which takes a Sandbox, a
"%{variable}" name and an absolute path, which could stage a resolved
variable as the content of a file in the sandbox safely.

It would be interesting to see a proposal for such a feature, probably
I would argue that the permissions used to stage such a file be very
limited, or unspecified (something matching the hard coded permissions
used to stage files from Sources into the sandbox).

That said, a controlled feature like this would be an extremely far cry
from allowing python code to simply write whatever they want into the
sandbox, and would not allow for the non-deterministic things which are
currently being done by existing plugins which exploit this currently
existing weakness.

This took me a while to write today, I hope this has achieved it's goal
in being as informative as I hoped it would be.

Best Regards,
    -Tristan

[0]: 
https://gitlab.com/BuildStream/bst-plugins-experimental/-/blob/master/src/bst_plugins_experimental/elements/oci.py

Follow-Ups:
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: William Salmon

References:
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: William Salmon
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: Tristan Van Berkom
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: William Salmon
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: Tristan Van Berkom
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: William Salmon
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: Tristan Van Berkom
- Re: [BuildStream] Protect against plugin modifications of artifacts
  - From: William Salmon

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]