Re: [BuildStream] [Proposal] Plugin fragmentation / Treating Plugins as Sources



Hi again Chandan,

I'm going to put my summary at the top instead of the end, there are a
lot of replies inline, not for the faint of heart ;-)

Summary
~~~~~~~
I am not entirely against this venv idea, the way that you initially
presented it however seems to ignore the non-trivial detail of how it
might actually work - which remains unclear.

I also think that it is a bit overkill, we would be requiring:

  * Python packaging metadata which we really dont require at
    a plugin level (`pip` origin requires it, but I think even
    that was done wrong, as explained in my last email).

  * Probably a lot of work to force this venv origin to work at
    all, unless Chandan knows of a magical solution that I am
    unaware of.

I don't really think that all of the extra work forcing this approach
to work is justified in order to satisfy the use case of a few edge
case plugins which happen to import external python libraries, we could
more easily say that plugins should not import external python
libraries without much negative impact.

That said, I should point out that nobody else is offering up to
actually do the work; I am willing to go even as far as implementing
this venv thing myself - but not if it means that I have to invent some
crazy thing which allows multiple venvs to work in the same environment
from scratch (that has the potential to tack on an extra week or more
to my implementation plan without even any guarantee that it is
possible at all, while my proposal will already take me somewhere along
the lines of 1 or 2 weeks).


On Sat, 2019-04-13 at 20:01 +0100, Chandan Singh wrote:
Hi Tristan,

On Fri, Apr 12, 2019 at 8:36 AM Tristan Van Berkom
<tristan vanberkom codethink co uk> wrote:


I think this is probably a stronger statement than we need. I'd have
thought it'd be something like "Plugins don't need to be packaged". Or
are you suggesting to remove the `pip` origin for plugins entirely?

I am considering proposing that, but this is very orthogonal from the
proposal at hand.

What I *am* proposing however, is that the upstream plugins which we
maintain in the BuildStream gitlab group as "blessed", be distributed
via the `git` origin. This is mostly because I would much prefer a
scenario where blessed plugins either all live in the same repository
or they live in very fragmented separated repositories, over any ad hoc
middle ground.

I am afraid I do not agree that we should be using Git as the preferred
distribution mechanism. Or, at least, not the only one. This is mainly because
of operational reasons.

First, Git is an inefficient way to transfer such plugins as we do not need the
history at all. Since we just need the content, transferring the `.git`
directory will be completely superfluous. Any kind of archive format will be
be better than Git in terms of performance.

Except that our SourceCache solution already solves this aspect for us.

This is a good argument however for ensuring that junctions and a
potential `git` plugin origin *do* use SourceCache.

Second, using Git as a distribution mechanism also raise scalability concerns
since Git (and Git servers too) is not designed for that use case. To see an
example of this, one does not have to look further than the CocoaPods disaster
of 2016 [1].

With BuildStream, we either download all the sources from their git
servers, or we hit the SourceCache - doing the same for plugins cannot
be more intense than doing it for source code, in any case plugins are
really equally files which we need to obtain in order to build.

Yes, a lot of the world is currently built from tarballs, but there is
an increasing desire and trend from upstreams to eliminate the
generation of tarballs altogether from the process, and have software
buildable directly from git, marking releases only with tags.

Even if we were to settle on Git as the distribution mechanism, I tihnk the
current proposal for `git` origin is basically reinventing Git submodules. I
think it will be superfluous to add such a `git` origin as everything that it
offers can already be done with Git submodules (or its counterparts in other
VCS farmeworks.) and `local` origin, whereby stores external plugins as
submodules. Won't `bst plugin fetch` end up being the same thing as `git
submodule update`?

We recommended that for bst-external, people just didnt do it
(potentially because we misguided them by providing the `pip` origin at
all in the first place).


Ok so here is my big issue: We're closing in on a time where people
want to yank out plugins from BuildStream.

Since our downstreams have already spoken and accurately pointed out
that fragmenting plugins into separate repositories is only going to
cause pain for all involved, I'm looking for an alternative.

I.e. it means that we cause pain for downstream packager volunteers who
would have to maintain a half dozen packages they didnt sign up for,
and more importantly it means that anyone who wants to build a
BuildStream project would need to themselves install the separate
plugin repos (either via the distro or with pip, either way) which the
projects they need to build require (which means that knowledge of
which plugins to install needs to be communicated to those who run
BuildStream).

So I think either the plugins need to be centralized, or the
BuildStream project.conf needs to record the plugins in such a way that
BuildStream can go and fetch them automatically - this reduces the pain
of requiring a plugin to an edit in the project.conf.

  A) If we ensure that there is only `bst-plugins-good` plus whatever
     local plugins a project uses, then we don't have this problem.

     But then there is not much point in moving the plugins outside
     of BuildStream at all.

     I am really fine with this, but due to the (completely irrational)
     desire I've seen around wanting to strip out the important ostree
     plugin from the core even in advance of other plugins, I worry
     that trying to maintain the core plugins in a single repo is only
     going to cause internal friction and arguments about what should
     be included or not.

  B) If we fragment plugins maximally, as stated above, we need
     BuildStream to automate the operation of obtaining them.


<snip>


There are some problems with this though, let me try to sort these out
with some responses to the statements you have made:

  * Plugins have dependencies.

    This is true, but plugins have always from the very beginning
    only been a python file with an optional matching yaml file
    with the same basename.

    It is the responsibility of Plugin.preflight() to report what
    system dependencies are required by the plugin at startup time,
    and the documenting of dependencies we've been doing is mostly
    a formality.

    Note that BuildStream itself does not hard require *any*
    of the dependencies required by the plugins BuildStream
    installs.

I don't think this is an accurate representation of the current situation.
Take bst-external for example, it depends on the `requests` package and
as such, defines it as one of its dependencies in its setup.py [2].

BuildStream does not require *at all* that bst-external behave in this
way.

 This means
that when one runs `pip install buildstream-external`, they don't have to
install `requests` manually. I think forcing people to do that will be an
inferior user experience compared to what we have currently.

I recognize this, but we have never required that plugin repositories
behave like this.

A more relevant reply is that external python library dependencies are
clearly the edge case, and it is the distro packaging which takes care
of most of the burden of obtaining the majority of dependencies for a
collection like bst-external (so installing it on debian might
automatically install git, bzr, ostree etc).

I.e. since we know for sure that source plugins will always have *some*
need for host dependencies, it doesnt make sense to treat the python
installer as if it were something that solves the problem of installing
dependencies - we have Plugin.preflight() because we know there will be
cases where a plugin file is installed but not yet ready to function.


Note that BuildStream's setup.py does not require any of the
dependencies for any of it's plugins, these are all soft dependencies
which can be detected at runtime.

As it is done in BuildStream core, we should be able to achieve the
same with plugins.


As a side note, the few plugins which blindly assume their external
python dependencies are present by naively importing at the toplevel
are buggy, they really should be:
  * Importing these dependencies in functions, on demand.
  * Importing in a try/except block in Plugin.preflight(), indicating
    the missing external python dependency at initialization time.

That is how our plugins are supposed to work, most plugins follow the
rule, but the few plugins which import external python libraries break
this rule (we usually don't notice this though, which is probably why
it has not yet been fixed).

If we split up the current Core plugins into separate repositories, I'd have
expected that each plugin will declare its own dependencies in its setup.py.

Most dependencies cannot be installed via python anyway, requiring a
setup.py and pythonic package structure for all plugins just for the
sake of a few outliers seems to be pretty overkill.


  * Plugins have external python library dependencies

    It was naive of me to ever think that it could be safe for
    a BuildStream plugin to import anything outside of the standard
    library, or the buildstream module itself.

    While this can work for a time, it is fragile - I don't think we
    should recommend anything which leaves opportunity for breakage.

    Consider that it has already been challenging for us upstream
    to always maintain a vector of python library dependency versions
    which are guaranteed to work together, some libraries (including
    even ruamel.yaml) need to be restricted or pinned, because
    the python ecosystem is not the shining example of responsible
    actors who care about API stability.

If we go with my venv proposal, we will not have to care about API stability.
If we install each plugin into its own virtual environment, it can express
fully resolved version of its dependencies, and does not have to care about
what other plugins (or any other host package) need.

I don't think your venv proposal is complete.

Honestly, from your writeup you did not seem to be proposing something
where each and every plugin only exists within it's own isolated venv
and namespace.

I even wrote in my last mail that I did consider this but thought it
was completely overkill, and the obstacles to making that work (if even
possible at all) are high.

Can you please explain exactly how we can have all of our plugins
living in isolated venvs and still callable in the same interpretor
process ?

Honestly, I *like* this idea, but I just don't see how it can be
possible without really extraneous amounts of work, which does not seem
to be justified by the edge case of a few plugins which import external
python libraries.

Angelos's email is insightful, maybe there is a well known trick of how
you can do this with something like pluginsbase + venvs that you are
aware of and I am not ?

    Now while this more or less works for an "application" like
    BuildStream, who only needs to care about itself working against a
    precisely selected set of library dependency versions; if we were
    to extend this to say that we encourage plugins to start importing
    external python dependencies - it would be the same as saying
    that multiple python applications with potentially differing
    version requirements on the same libraries need to work within
    the same process/interpretor environment, in other words it
    just very practically cannot work.

    So yes, a *few* (ostree, docker, maybe pip ?) import external
    python libraries, ones which we pretty much deem to be stable,
    but I think promoting this is not viable - in fact we should
    really make it a rule, and have source plugins *only* ever
    call out to host tools.

Why is that? I am not sure I understand why is using a host tool instead of
a library is automatically going to fix the issues with API instability. If
a tool has both a Python API and a CLI tool (like BuildStream itself), and
is unstable, then swapping API for CLI isn't magically going to make it stable.
In fact, using the Python library is often cleaner in my opinion as one does
not have to worry about subprocesses etc.

I think you are really reading completely passed what I wrote, and
replying to something different.

The above explains how Python libraries dont even *intend* to be
stable, it is *common practice* to use the same python library name to
reflect multiple different APIs, and to inform users of such libraries
that such and such version of the library has such and such API - that
is not what API stability *is* (and this is exactly what ruamel.yaml is
doing now).

Further, since that is the way many python library maintainers *intend*
for things to work, we cannot provide any guarantee that two plugins
which require separate specific versions of the same library can work
at all, because the library doesn't change its name, we cannot import
both into the same interpretor environment.


Maybe your venv proposal can solve the above in which case it will no
longer be an issue, but I don't really think you have proposed it yet,
it looks like something needs to be invented for that to happen, or you
know of a solution for precisely this, and I am not yet aware of it's
existence.


  * We dont bless a VCS if we use venvs

    No, but we *do* end up blessing PyPI and python packages,
    this is *at least* as bad as blessing a single VCS
    (and I think the whole world is basically onboard with git
    at this stage really).

I think it is really ironic you say that when one of the Core source plugins
is `bzr`. If someone is using `bzr` to manage their source code, I'd say there
is a good chance that they may want to use it for storing their BuildStream
plugins as well.

We cannot dictate to anyone how to manage their source code, especially
when most of the modules someone wants to build with BuildStream is not
even managed in their own VCS but obtained from a third party.

We can certainly dictate that if you want to maintain a BuildStream
plugin via the `git` origin, that you must host it in a git repository.

I think blessing PyPI is less bad as that is that standard way to distribute
Python packages, which is what BuildStream plugins are at the end of the day.

That is again false.

A BuildStream plugin is a *file*.

By python's definitions, a package can be one of two things:

  * A directory of python files which contain an __init__.py

  * Something which can be distributed via PyPI, complete with
    the package metadata bells and whistles

BuildStream doesnt require either of the above: A BuildStream plugin is
a python *file* which BuildStream carefully loads into an isolated
namespace (such that multiple, differing plugins of the same name can
exist in the same interpretor).


    We also impose that plugin repositories do some python
    package voodoo when really, BuildStream plugins only really
    need to be a file (you admit to this in your writeup as well).

We do not force anyone to use Git to version control their plugins, we do not
even require that plugins be in version control at all. However, we do require
that they be written in Python. Since we already require that the plugins to be
written in Python, I don't think it is completely unreasonable to expect plugin
authors to also write a 10-line setup.py, which is the defacto python packaging
system.

Python is a language choice, it is by no means at all a choice to
participate or buy into pythons packaging and distribution frameworks.

If we can avoid requiring that setup.py, since we already treat plugins
as simple python files (not packages), then I think we provide added
value to plugin authors.

  * We don't have a venv to install into, anyway.

    BuildStream runs where you install it, and if you are going to
    install plugins and their dependencies into a venv, BuildStream
    needs to also be installed in that same venv.

Yes, but I am proposing that we create a virtual environment for our plugins.
BuildStream should manage it as an internal implementation detail. For example,
we can use `pip` to download packages along with their dependencies in a
directory that we control and then use variables like `PYTHONPATH` to ensure
that we can find the plugin correctly. There can be other solutions with a
similar approach as well.

I highly doubt that it is as simple as just using PYTHONPATH.

We would need to ensure that plugins import everything in separate
namespaces, probably this extends to overriding python's import
mechanisms for plugins altogether (as pluginsbase does), in order to
redirect all imports to the virtual env we've created.

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]