Re: Proposal: JSON for artifact metadata instead of YAML

From: Angelos Evripiotis <angelos evripiotis gmail com>
To: Tristan Van Berkom <tristan vanberkom codethink co uk>
Cc: buildstream-list gnome org
Subject: Re: Proposal: JSON for artifact metadata instead of YAML
Date: Wed, 1 Nov 2017 15:53:31 +0000

Hi Tristan,

Needless to say I think; I dislike this very very much.


I'm sorry for that! I did get that impression from talking on issue #82 that it
might be the case. I had hoped to win you over with detailed points but I think
instead I've caused you to write a lot about something you'd rather not right
now.

Backwards compatibility is not a concern, at, all. Artifact format is
not stable for 1.0, nor is there even any public entry point in 1.0 API
for extracting a full artifact and looking at the metadata.

If people go poking into the private buildstream caches and things
break, that is entirely their own problem.

For the duration of the 1.1 dev cycle at *least*, we are free to change
artifact format as frequently as we want, this is simply a matter of
bumping the core artifact version which ensures that new cache keys are
created and BuildStream wont ever load an artifact it doesn't
understand - simple as that.


Good news! Thanks for explaining.

Lets elaborate a bit on what "Consistency" means for us...

Artifact format (i.e. the layout and content of an artifact) is not
stable now, nor will it be for 1.0, but there have been some decent
arguments made for considering making artifact format stable, this
might even happen as early as 1.2 (next stable BuildStream).

This means that whatever decisions we make now regarding artifact
format, will follow us *forever* - and this inconsistency will become
inconvenient to users.


Perhaps I treated consistency too lightly in my comments, I take your points
and also want to avoid permanent inconvenience.

We already have some outputs which are yaml format intended for machine
readability, e.g. `bst workspace list` outputs yaml because it was
considered that some projects may want to automate workspace creation,
introspection and deletion for CI purposes - similarly `bst show`
outputs yaml for some things, and this is also a focal point for
scripting and automation around BuildStream.


Although I didn't say it before, I'm thinking JSON would be more convenient for
general machine readability too, explanation follows below.

If we introduce JSON in a part of our public API surface (which
artifact metadata will inevitably become if we make artifact format
stable and expose it to external tooling), that means that projects
which consume BuildStream will have to use multiple parsers and
understand multiple formats to interact with BuildStream and script
around it, forever.

This significantly detracts from the simplicity / ease of use of
BuildStream as a whole, and as such it is a highly undesirable outcome.


I would agree for many formats, in the case of YAML and JSON I think we're in
luck. It turns out that JSON is YAML, so all YAML 1.2 readers can also read
JSON:

    "The primary objective of this revision is to bring YAML into compliance
    with JSON as an official subset."

    http://yaml.org/spec/1.2/spec.html (from late 2009)

Also, in support of Sam's comment, the YAML spec seems to suggest that JSON is
better for scripting, and it says YAML is better for humans:

    "Both JSON and YAML aim to be human readable data interchange formats.
    However, JSON and YAML have different priorities. JSON’s foremost design
    goal is simplicity and universality. Thus, JSON is trivial to generate and
    parse, at the cost of reduced human readability. It also uses a lowest
    common denominator information model, ensuring any JSON data can be easily
    processed by every modern programming environment.

    In contrast, YAML’s foremost design goals are human readability and support
    for serializing arbitrary native data structures. Thus, YAML allows for
    extremely readable files, but is more complex to generate and parse. In
    addition, YAML ventures beyond the lowest common denominator data types,
    requiring more complex processing when crossing between different
    programming environments."

    http://www.yaml.org/spec/1.2/spec.html#id2759572

I believe we would be doing ourselves a great disservice to be making a
decision as permanent and undesirable as this, just because it would
cost a bit more up front to get load speeds on par "right now".


I'm not looking at JSON vs YAML so much as a performance hack.

I imagine it will be possible for us to contribute optimisations to ruamel, I
haven't tried to see what they think about it. Perhaps they'd accept an
optimised C extension, and we might be able to afford to create it.

I'm thinking more that given my quotes from the YAML spec, optimising YAML
meant for machine consumption over just using JSON might be "doing it wrong",
and going against the grain. YAML is more complicated, for purposes that don't
apply to scripting around our output.

Given that JSON is YAML, I think it's fair to say it's more widely supported,
which is better for would-be scripters.

We also might pass this optimization burden onto all that might work with our
data. I would have to repeat the experiments with other common YAML parsers to
be confident in this point, maybe they're all super quick :)

Doing things "nicely" is "hard", yes - is it so "hard" that we have to
sacrifice "nice", permanently ?


I think we share the same intent to put in the hard work to make sure
BuildStream is done the "nice" way.

Please do call me out if it seems like I'm going for "quick fixes", that's not
my intention here or with BuildStream in general :)

There are a lot of other things to consider here:


I'll leave the points on the measurements being too narrow - the separate
discussion on benchmarking from the user's perspective is something we both
support.

In general I agree with you that the numbers I presented are narrowly focussed
on something which is not yet in BuildStream. I have no evidence to suggest
it's any concern for current usage.

Once you can trust the provenance of the artifact you are receiving,
all of your security concerns become entirely moot.


For security I think we both agree that trusted provenance will be a big help.
I disagree that it solves serialization safety concerns completely, maybe a
topic for after trusted provenance is a thing.

Thanks for spending time on this topic, despite your dislike for it.

Cheers!
Angelos

Follow-Ups:
- Re: Proposal: JSON for artifact metadata instead of YAML
  - From: Tristan Van Berkom

References:
- Re: Proposal: JSON for artifact metadata instead of YAML
  - From: Tristan Van Berkom

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]