Re: [BuildStream] [Summary] Plugin fragmentation / Treating Plugins as Sources



Hi Ben,

On Thu, 2019-04-25 at 17:45 +0000, Benjamin Schubert wrote:
Hey everyone,

Thanks for the summary Tristan!

Given the lengths and number of responses, here are a few snippets of answers.

But first TLDR:

Thank you also for taking the time and for summarizing your reply !


I will start this off by also summarizing my reply:

 - I mostly agree with Ben's TLDR, and in the places where I do not
   agree, I think that disagreement doesn't need to cause conflict,
   and everyone can have what they want in some way.

 - I am not against keeping the existing `pip` origin around.

   There is a use case for system-wide plugin collection installations.

 - I disagree that plugins should have external dependencies of any
  
kind for anything other than downloading/tracking sources.

   While basically everyone in the room strongly disagrees with me on
   this point, it doesn't really need to be a point of contention (i.e.
   we can still keep the `pip` origin around anyway, we should still
   be honest about any caveats), still it is worth pointing out and
   is further detailed in this message.

 - I am searching for a way to fragment our core upstream plugins in
   such a way that BuildStream conveniently downloads them, removing
   any need to argue about whether, for example, the important ostree
   plugin can live in a single repo with the rest of the blessed core
   upstream plugins or not.

   While I think there are good benefits to having the project
   declare their own plugins at specific versions and this solves a lot
   of other problems, the above is really the main driver of the
   proposal.

 - This email also includes a possible alternative to `git` origin
   and `venv` origin, which is somewhere in between the two.

   In this scenario, we could keep many of the advantages highlighted
   about the `venv` proposal, without actually doing any `venv`
   management or automating the installation of any third party python
   dependencies.



---- Inline replies ----

- I don't disagree with adding git sources for plugins.
  I don't however see any values in it, what's the difference with git
  submodules and local origin?

The advantages are that:

  * People loathe git submodules with such a passion that I'm unable to
    get people to use them. Having BuildStream fetch the modules for
    you avoids any hassle with initializing submodules.

  * Having the git source in project.conf couples the precise plugins
    you are going to use in project data, rather than the VCS you might
    use.

    Without this, we cannot easily consider the exact plugin version
    in cache keys, which is useful for plugins which are not themselves
    providing stable API, which is a freedom I think it is nice to
    offer users who casually develop their own plugins, and was
    requested here:

        https://gitlab.com/BuildStream/buildstream/issues/953

  * Similar to the above, we have use cases where people do not
    revision the `project.refs` file but only revision the
    `junction.refs` file, and do continuous integration builds
    with `bst build --track-all <targets>`.

    When we hit a release, it is typical for us to export a tarball
    of the repo, in this case we don't include the git metadata and
    we still want to export the exact project state for which a release
    was made (which can include the plugins which were in use).


- I don't disagree with trying a venv implementation for isolating plugins.
  I'm however very worried about it, and think it should be avoided as the
  result would be worse to maintain than having compatibility layers in some
  plugins

I am of a similar mind yes.

- I disagree with removing the pip source. software has to be packageable and
  platform integrators should be able to expect software not to connect randomly
  to the internet to fetch things at runtime.

I'm not opposed to keeping the pip source around really, if the set of
plugins has an unstable API, it can still be used safely in controlled
environments where you manage all projects in the supply chain.

It is really fine even for general public usage so long as these
plugins have stable API and depend on API stable libraries.

While there is no reason to really go back and remove `pip` origin,
what we've seen so far is that people use it for unstable plugin APIs
in interdependent projects which are managed by different groups, which
sort of defeats the purpose and adds friction on junction ref updates
where there really should be no friction.

[...]
- I disagree that plugin incompatibilities with different third party libraries
  is a problem. Most linux distribution include a _lot_ of python software and
  the tools _do_ work all together with minimum friction. Yes it might mean a
  bit of compatibility layering but it's far easier than hacking something
  around venvs, or preventing third party libraries

You clearly have more knowledge on the python library situation subject
than I do, from my perspective I am getting mixed messages.

On the one hand, I have people saying one should always use a venv, and
on the other hand I have people saying that there is a lot of API
stable python software we can rely on.

In the end, as long as I can still have my pip source and _prevent_ BuildStream
from looking for plugins by itself, I'm ok.

I think that this does not need any special configuration or switch,
the project.conf is already responsible for declaring what plugins it
wants to load and from where - if one does not want to have BuildStream
download it's plugins, then one simply need not declare plugin origins
other than `pip` in their project.conf.


I am however curious about the aversion to allowing BuildStream to
fetch the plugins for you in your scenario - do you see any potential
security risk here ?

I would assume that if BuildStream downloaded the plugins for you, they
would come from the same location as you download other source code
from, so I'm not sure how this breaks builds in an isolated and
controlled build network ?


===== Other responses =====


Thu, 18 Apr 2019 20:05:37 +0900, Tristan Van Berkom wrote: [0]

  * Avoids any need for distro packaging of plugin files
 * Avoids project users needing to know what plugins a project
   requires and needing to explicitly install those plugins
   in any way


Avoid? I can live with that.
However, it should still be possible to configure BuildStream to NEVER try
to fetch things by itself concerning plugins. In some constrained environments,
this would not be welcome. We should therefore make sure that we can still
prevent BuildStream from fetching any of its dependencies by itself.

As mentioned above, everything is always explicit, I would also be
against BuildStream doing things implicitly which I did not tell it to
do.

Moreover, python packages and distribution packages are two different things.
There is no inherent need of packaging all python packages in a distro.
As far as I know, most distribution only package a tiny subset of all python
packages.

Right, but keep in mind that the goal here is to allow maximal
fragmentation of our plugins in order to avoid friction - our distro
maintainer contributors have complained that fragmenting our upstream
set of plugins into a hand full of separate repositories unnecessarily
increases their work load, and also remind us that this fragmentation
also makes BuildStream less useful "out of the box" because users would
then need to know what packages to install for what projects they work
on.

In this light, as you are against using a `git` origin in your own
projects, I worry that you will also be against fragmenting the
upstream plugins into separate git repositories, and maintaining our
core upstream plugins as such, which kind of invalidates this as a
solution to distribution of our blessed plugins.

So I wonder, is there some middle ground we might have here ?

There are two approaches I can think of:

* Maintain our upstream plugin packages in such a way that they *can*
  be installed with pip, but still allow projects to access these
  very same plugins with `git` origin as well.

  At the same time, we would discourage any distro-level packaging
  of any plugins; so if you *want* to use these without the git source,
  you are rather forced to `pip install` them in your python
  environment.
 
* Similar but different to the `venv` counter proposal, we could
  have BuildStream download packages but *not* install them into
  venvs.

  We could extract the plugins from the "wheel" and load them as
  regular "local" sources, but we could also extract the requirements
  from the downloaded packages and automatically assert whether the
  requirements are present, and we could error out with a helpful
  error message instructing the user to `pip install ...` a list of
  requirements extracted from the downloaded packages.

  This would also allow usage of the same plugins via both origins,
  but would not require that plugins `except ImportError` at
  Plugin.preflight() time.

  This would also have the advantage of allowing people to distribute
  their plugin with any pip supported URL (PyPI/git/bzr...), not to
  mention it would definitely be safe, and a lot less work compared
  to the venv proposal.

As Sam Thursfield pointed out[6], the dependency on the `requests`
library could instead be satisfied by BuildStream itself providing a
more efficient API for fetching files, so BuildStream would instead
depend on the `requests` library, and have the freedom to later
redefine how this is implemented in the future (I quite like this idea
in fact).

I do not like this idea, that would transform BuildStream in adding cruft
in the core just because one plugin might need it. If we push that further, we
would end up with git, svn, bazar, requests, and many other things either
exposed or reimplemented here.

Moreover, if we wanted to do that correctly, that would mean that BuildStream
would need to know about anything that a plugin would want to make, and, thus,
would render the use of plugins pointless, since we could just do it in the
core. Plugins are there to allow to add more capabilities to BuildStream.

I disagree here.

What plugins do is essentially:

* Provide YAML API for elements (BuildElement/ScriptElement subclasses)
* Access filesystem paths to move files into a sandbox or around inside
  a sandbox (compose elements and similar)
* Instruct the Sandbox to run commands
* Share metadata amongst elements via public data, which might
  inform what commands are run or what files are moved to what
  locations
* Download sources
* Track new versions of sources

If a plugin is tampering with the inputs or outputs in any way other
than just moving them around (like generating reports or png files ?),
then it violates the rule that all filesystem data permutations happen
in a controlled sandboxed environment, and we leave ourselves open to
host contamination.

Within the strict constraints of things that plugins should be doing,
there is still no way we can predict what exactly a plugin will decide
to do, hence the plugins are really useful in this way.

Of course this leaves downloading/tracking of sources, but this is
really the only thing I can imagine, and we can use host tools for this
(and we can even make that more robust by sandboxing those tools).

I think plugins could even have been implemented as an intentionally
strictly limited custom scripting language, giving us greater assurance
that nothing nondeterministic can occur, and the situation would be
even more robust, we could even safely port to Python 4 without any API
break at all.

Anyway, I think that we have very differing views on this but that we
can safely agree to disagree, as these differing viewpoints don't
necessarily need to cause conflict.

[...]
Except for the libc API (which had some breakages), can you give me a single
opensource software that never broke an API? This issue is true for every
software. Many good software try to keep things compatible. And python is no
stranger to that. If you look at requests, which was mentioned multiple times
here, Openstack has a dependency on requests >= 1.2.6... more than 20 releases
without breaking the API, over many years. The question here is choosing wisely
packages that do care about that.

I'll refer to my answer to your TLDR here, I am getting mixed messages
about whether or not python libraries intend to be API stable.

but in the current landscape we don't really know this anymore, as everyone
is now buying into the "vendoring" approach

Let's take a few big python packages:
- OpenStack doesn't vendor
- Ansible doesn't vendor (And gosh they have so many plugins and dependencies!)
- SaltStack doesn't vendor (Again so many plugins and python dependencies)
- Django doesn't vendor (1k+ plugins on https://djangopackages.org)

I don't know many packages  in Python vendoring. The only one I actually know
is Requests, which vendors urllib3 from the standard library, for a very
specific reason: Python releases being 18 months apart, a vulnerability flaw
in urllib3 might take up to 18 months to be patched and the decision was
discussed heavily before vendoring.

My mistake here to use the word vendoring, I rather borrowed that term
from the very limited things I learned about building rust applications
(we had to "vendor" the specific versions required by apps into the
build environment in order to build with cargo/rust without internet,
this process appears very similar to "venving" to me).

What I essentially mean is that if everyone is trusting venvs and using
that approach to pin dependency versions (what I meant by "vendoring"),
that leads me to believe there is less intent to maintain stable APIs
in python libraries (and perhaps that intent is evaporating).

[...]
Yes, and plugins that are heavily used will very likely have to have some
compatibility layer. I don't see why this would be different with python
dependencies.
Moreover, there is no inherent reason for a cli to be more stable than a Python
API, and thus using cli will not buy us anything (Let's remember the numerous
git bugs found when we added a CentOS runner?)

You appear to have a different perception, and I trust that your
knowledge of the situation is deeper than mine.

Honestly, the reason for this perception is that people insist on using
venvs more and more, which to me sends a message that APIs don't need
to be stable (otherwise why need a venv in the first place ?).

The fact that BuildStream depends on ruamel, which explicitly intends
to break API without creating a ruamel2 is also a signal that this kind
of breakage is considered acceptable by a large portion of python
developers.

See the note about pinning ruamel: https://pypi.org/project/ruamel.yaml/

If we had one plugin needing the stable "ruamel" API and another plugin
needing features from the new API of "ruamel", and they are both called
"ruamel", then we have a plugin conflict already.

So it seems that we have a situation where one class of libraries are
safe (which includes the "requests" library), and another class of
libraries are not safe (which includes "ruamel.yaml"), and it is very
hard for me to tell what the overall story is here, or how to determine
which libraries have the intention of being stable.

This approach also allows projects to declare a sysroot which
has the tools *they* need, without imposing one on downstream
junctioning projects.

Allows projects yes, but plugins will still need to be compatible. So that
would not solve any problem or am I missing something?

Essentially I am arguing that the only reason to import an external
python library in a BuildStream plugin is to download source code, so
*if* we limit ourselves to host tools only for this purpose, we really
have no need at all for external python libraries, and we can more
easily guarantee a robust operation by providing those host tools in a
sandbox with networking enabled, and plugins could then never conflict
at all.

I don't think it is unreasonable to have some plugin's be
incompatible with each other, we should strive to avoid this but do
we really want to make this our responsibility to make this
impossible?

I disagree. We should make sure that all _blessed_ plugins are compatible.
We should _not_ care about the rest. Does the kernel care about API breakage
for third party modules? Not at all. Why would we here?

I am happy with a statement that we only ensure that blessed plugins
are always compatible, and that there is no way to use a project with
incompatible upstream blessed plugins (this is a bit difficult without
a monorepo, though), and to only inform external plugin developers that
they should just never import external python libraries if they want
absolute certainty that their plugins will always work, and not cause
dependent projects to break.

But this robustness for the plugins we do not maintain is absolutely
crucial to the whole plan.

We are building an ecosystem of projects which depend on eachother,
maintained by separate organizations, and are allowed to write and
maintain their own plugins for whatever reason, and we are at least
tying to tell them that it will keep working, hopefully for decades.

I do not expect to maintain everyone's plugins in a monorepo like Linux
basically does.

I expect projects to write their own plugins for their own purposes (as
people have been doing so far, actually), which are fairly project
specific (like freedesktop-sdk's flatpak manifest generator), and I
don't have any interest in maintaining any of that domain specific
stuff upstream, I don't think there is any need for that.

Because I expect plugins to always work without exception to the best
of our abilities, for different projects with different sets of plugins
to integrate seamlessly together, where the only API bridge between
projects are the element names, project options, and the binary data
and public data extracted from cross project element dependencies.

That is why we have a very stable and strictly defined public API
surface for plugins, and that is why we load plugin modules into
separate namespaces on a per project basis, ensuring they never clash.

Now if we add to this only stable python libraries as dependencies,
ensuring that plugins can only depend on python libraries with minimal
bound dependencies (e.g. ">= 1.2"), then this all still works just
fine, and there is no need to worry about conflicting plugin
dependencies.

If we don't outright ban the imports of external python libraries the
least we can do for external plugins is clearly advise about the risks
attached and the possibility of conflicting dependencies with upstream
or downstream projects who also write plugins.

I don't really understand why this should be such a big issue, I
understand the *desire* to have that freedom, but I don't see how
having that freedom buys us much given what plugins actually do,
especially if any reliability and stability needs to be sacrificed on
the alter of developer freedom and convenience (that doesn't add up to
me at all).

Having to reinvent things that exist, are well tested and relied on is more
than an inconvenience, it is of questionable value, and likely requires
investments to get to a similar level of quality than anticipated. On top of
this we would also lose an important aspect, which is familiarity with well
known packages and APIs.
Yes I want to be able to use external tools and libraries when writing plugins.
No I don't want to have to rewrite git in python.

Right, but for git, as with most things, we use host tools, not python
libraries, even now.

I mean, I agree with not reinventing things that exist, but do we
really have significant examples of plugin activities which really
require an external python dependency that could not be satisfied with
a host tool (which could eventually be even more reliable if we provide
a project-declared deterministic sandbox where those host tools can
run) ?

If people want to develop their own plugins which do depend on 
unstable third party python libraries, we should certainly advise them
of the risks attached to this.

On Tue, 23 Apr 2019 20:01:34 +0100, Thomas Coldrick wrote: [2]

That said, it looks like this would add a great deal of complexity to
BuildStream internals, and perhaps be built on barely-maintainable
hacks, if it's possible at all.
I think that unless the venv proposal can be done cleanly, then using a
remote-type source is preferable. There are, however, some caveats to
this.

I completely agree

Maybe if it is not going to be `git`, it should be the alternative I
suggested above in this mail: to use packages but not install them into
any venv, and just error out if the dependencies are missing with a
clear error message.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]