Re: Proposal: JSON for artifact metadata instead of YAML
- From: Tristan Van Berkom <tristan vanberkom codethink co uk>
- To: Angelos Evripiotis <angelos evripiotis gmail com>, buildstream-list gnome org
- Subject: Re: Proposal: JSON for artifact metadata instead of YAML
- Date: Wed, 01 Nov 2017 17:04:18 +0900
On Tue, 2017-10-31 at 12:39 +0000, Angelos Evripiotis wrote:
Hi!
I'd like to propose that we use the Python library json module for saving and
loading artifact metadata, instead of BuildStream's _yaml module.
Hi Angelos,
Needless to say I think; I dislike this very very much.
Don't consider this as a flat out "No", but I do think this is
premature, I feel that before we make such an undesirable change we
should evaluate this properly - some of this proposal is founded in
speculation and we should think more before we make this trade off.
In advance of my verbose replies, I'm going to propose to you an
alternative and IMO more productive course of action - you obviously
take performance very seriously, and I would also like to, let's do it
properly then.
To this end, I suggest that you do the following:
o Collect some use cases where performance is critical, I mean use
cases that are visible from the BuildStream CLI frontend.
E.g. `bst build`, `bst checkout`, etc
What about cases where `bst build` is using import elements, vs
compose elements, etc.
o Create a benchmarks test suite based on these frontend toplevel
cases, not based on internal operations.
I would like to see graphs rendered from collected data points,
using simulated projects:
o What is the "per element" build time for a simulated project
with only import elements.
What does a graph look like when simulating one element vs
500 elements vs 1000 elements vs 10,000 elements, etc
o How does the above graph compare with an import element which
stages 1 file, vs 10, 100, 10,000, 100,000 files ?
Further, we should re-run the whole benchmarks for every publicly
released BuildStream version over time. How does 1.0 compare to
1.1, 1.2, etc - Have we regressed or improved performance ?
Lets really know about it.
Comments in line below...
I think we should do this mostly for performance and a bit for security.
Here 'artifact metadata' means the data that is part of a build artifact, and
stored in the 'meta' directory. e.g. all these files:
~/.cache/buildstream/artifacts/extract/project/element/22ba6/meta/*.yaml
I think _yaml is very good for configuration, i.e. .bst files project.conf and
am not suggesting that we change that.
As far as I can tell, JSON supports the current and proposed uses for artifact
metadata. If adopted, I imagine we'd have some period of backwards
compatability before dropping support for the old .yaml metadata files (pre
v1). I also volunteer to undertake the work to do it before v1 (this week?).
Backwards compatibility is not a concern, at, all. Artifact format is
not stable for 1.0, nor is there even any public entry point in 1.0 API
for extracting a full artifact and looking at the metadata.
If people go poking into the private buildstream caches and things
break, that is entirely their own problem.
For the duration of the 1.1 dev cycle at *least*, we are free to change
artifact format as frequently as we want, this is simply a matter of
bumping the core artifact version which ensures that new cache keys are
created and BuildStream wont ever load an artifact it doesn't
understand - simple as that.
Here are the considerations I made.
Consistency
-----------
Consistency is good, ideally we wouldn't use a different format for each thing.
Use-cases should be quite divergent before you consider adding another format
to worry about. This is a point against JSON for artifact metadata.
Lets elaborate a bit on what "Consistency" means for us...
Artifact format (i.e. the layout and content of an artifact) is not
stable now, nor will it be for 1.0, but there have been some decent
arguments made for considering making artifact format stable, this
might even happen as early as 1.2 (next stable BuildStream).
This means that whatever decisions we make now regarding artifact
format, will follow us *forever* - and this inconsistency will become
inconvenient to users.
We already have some outputs which are yaml format intended for machine
readability, e.g. `bst workspace list` outputs yaml because it was
considered that some projects may want to automate workspace creation,
introspection and deletion for CI purposes - similarly `bst show`
outputs yaml for some things, and this is also a focal point for
scripting and automation around BuildStream.
If we introduce JSON in a part of our public API surface (which
artifact metadata will inevitably become if we make artifact format
stable and expose it to external tooling), that means that projects
which consume BuildStream will have to use multiple parsers and
understand multiple formats to interact with BuildStream and script
around it, forever.
This significantly detracts from the simplicity / ease of use of
BuildStream as a whole, and as such it is a highly undesirable outcome.
Performance
-----------
This is the main motivation for this proposal, with the idea that we
will need to
work with platforms that have on the order of 100,000 files.
As I understand it, your primary motivation is exactly this, 100,000
file manifests.
Your case is that for a single artifact with 100,000 files, parsing
YAML is more costly than parsing JSON, this by itself is a strong
point.
Since we are talking about an API surface which will become stable, I
think we have to ask ourselves:
"Is loading 100,000 file manifests slower to parse with existing YAML
parsers than with existing JSON parsers ? Or is YAML really
inherently that much more difficult to parse over JSON ?
Is this something very silly, like for instance are we comparing
a python parser implementation with a python binding to an
underlying, optimized json-c library ?"
I believe we would be doing ourselves a great disservice to be making a
decision as permanent and undesirable as this, just because it would
cost a bit more up front to get load speeds on par "right now". Doing
things "nicely" is "hard", yes - is it so "hard" that we have to
sacrifice "nice", permanently ?
There are a lot of other things to consider here:
o I highly doubt that the order of magnitude of YAML over JSON
load times is a linear increase - was this taken into
consideration, or measured, even ?
My gut tells me that the load time of one list element is increasing
with the size of your data set - which might only be justified in
cases where "&" and "*" operators actually start appearing in the
parsed YAML stream - this would indicate that it is possible to
achieve similar load times to JSON at least in the cases where the
referencing "&" and "*" do not appear.
o How does this really impact the BuildStream session ?
- Which kind of BuildStream sessions necessitate loading the
metadata, or if they need to load the metadata, do they also
have to load the manifest ?
Looks like we might need this mostly for `bst build` and
for `bst checkout`, most other commands dont need this operation.
- In the case of `bst build` - my gut is telling me that we're
talking about a manifest that we can load once, after which point
it will already be on the heap.
Realistically, you might have a couple mega huge artifacts to
stage in the course of a build, they need to be staged hundreds
of times but only need to be parsed once.
So, are we optimizing a build that takes around 4 hours to
complete, by a couple of minutes ?
Would shaving off a couple of minutes from a 4 hour build really
be worth this inconsistency ?
Before coming to drastic conclusions here, we should be analyzing
the data and running benchmarks on the BuildStream frontend against
simulated data sets which resemble real world use cases.
o You have also left out the part where, apparently performance is
quite decent with the newer YAML interfaces from ruamel.yaml, so
long as we stay away from loading dictionaries:
https://gitlab.com/BuildStream/buildstream/issues/82#note_44846694
Which means to me, that denormalization of the metadata, at least
the manifest, is an option that we can consider before resorting to
introducing a second data format into potentially public API.
If we had e.g. manifest.yaml separate from manifest_attrs.yaml, we
could get away with avoiding dictionaries for one (which is probably
still a cheap workaround to a problem that is better addressed
in the ruamel.yaml library) - but we also gain a possibility of
lazy loading.
Maybe we only need to know the full list at times, maybe we only
need to know the attributes of things at other times.
o We currently dont even have a manifest in play, which just adds
more to how premature this proposed change is.
buildstream._yaml.load() is significantly slower than json.load():
# json saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import json; f = open("data.json", "w");
json.dump(data, f); f.close();'
# real 0m0.463s
# user 0m0.440s
# sys 0m0.010s
# json loading
time python3 -c 'import json; f = open("data.json"); loaded =
json.load(f); f.close(); print("\n".join(x["filename"] for x in
loaded["files"]));' > data.txt
# real 0m0.094s
# user 0m0.080s
# sys 0m0.010s
# _yaml saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; import buildstream; buildstream._yaml.dump(data,
"data.yaml");'
# real 0m15.008s
# user 0m14.730s
# sys 0m0.160s
# _yaml loading
time python3 -c 'import buildstream; loaded =
buildstream._yaml.load("data.yaml"); print("\n".join(x["filename"] for
x in loaded["files"]));' > data.txt
# real 2m58.688s
# user 2m58.400s
# sys 0m0.260s
As discussed on issue 82 [8], there are faster loading options available but we
would have to use an interface that's not considered stable yet. Here's the
times for that:
# YAML(typ="safe") saving
time python3 -c 'data = {"files": [{"filename": str(x)} for x in
range(100000)]}; from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml", "w"); yaml.dump(data, f);
f.close();'
# real 0m2.553s
# user 0m2.480s
# sys 0m0.060s
# YAML(typ="safe") loading
time python3 -c 'from ruamel.yaml import YAML; yaml =
YAML(typ="safe"); f = open("data.yaml"); loaded = yaml.load(f);
f.close(); print("\n".join(x["filename"] for x in loaded["files"]));'
data.txt
# real 0m2.787s
# user 0m2.610s
# sys 0m0.170s
Unfortunately that's still relatively slow.
Security
--------
Artifact metadata is intended to be shared between machines, using 'bst pull'
and friends. This means users may share arbitrary data with each-other, which
is unlikely to be scrutinized. Code will execute against the data outside the
sandbox. This seems a possible target for exploits.
The moment that you cannot trust the provenance of the artifact you
have received, you already have a problem.
Once you can trust the provenance of the artifact you are receiving,
all of your security concerns become entirely moot.
Yes, the current state of things are fragile, we rely on setups where
we download artifacts over https and only allow uploads from authorized
builders over ssh - A proper solution will be to add GPG signing as a
first class citizen of BuildStream, and ensure trust this from end to
end.
Cheers,
-Tristan
[Date Prev][
Date Next] [Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]