[BuildStream] YAML rework - Another go



Hi all,

I, James, Ben, Chandan, Angelos, and Gökçen, have been discussing ways
to improve matters in the YAML layer for some time.

At the gathering, Tristan Maat and Gökçen had a go at a preliminary approach
replacing ruamel.yaml with pyyaml and tracking pathways through the YAML so
that provenance could be reconstructed later if needed.

James continued to look at that but the conclusion we reached was that it was
almost as much work to maintain as the provenance data was in the first place.

Given that, we needed to find a way to represent provenance in an easier to
copy/merge/track manner such that we could use universally and yet have it
be faster than the current approach.

Our current approach is very object-heavy and results in complexity for
pickling/unpickling, along with difficulties during node composition among
other operations.

Having carefully written out what I believe to be a complete statement of valid
YAML as regards BuildStream's inputs, I have designed a data format which I
believe will give us all the properties we need, but without the downsides of
the current approach.

As a comparison point, simply parsing the YAML using ruamel.yaml with the
CBaseLoader as the loader, into non-provenanced and non-roundtrippable data
structures, takes approximately 15s with a max-rss of 158M on my laptop for the
test set I was using (debian-stack.bst) and a trivial loader which chases the
dependencies plainly without using BuildStream's include logic etc.  Using the
same base loader, but with a representer of my own devising, which only handles
structures which are valid for BuildStream, and which records provenance data
as it goes, it takes 13.5s to load the same data in with a max-rss of 182M to
include provenance data.  As a final comparison, the pickle of the same data
but loaded with BuildStream's provenance approach and round-trippable ruamel
structures is 214M at rest on disk and takes around a minute to fully unpickle
into memory.

Over the coming week, James and I intend to produce a prototype branch with the
_yaml.py functionality replaced with this new data structure in an attempt to
gain a viability profile on whether this speedup will carry forward longer-term.

If anyone can spot holes in our approach we'd gratefully receive hints/tips.
And we'll be pleased to discuss it on #buildstream or here on list.

D.

Appendix 1 - Definition of BuildStream's inputs:

* Each file **MUST** be a YAML mapping.
* Each mapping key **MUST** be a string.
* Integers, floats, etc, are deserialised as strings
* Values are one of:
    * String
    * List
    * Mapping
    * CompositionContainer
* Mappings may optionally contain key/value pairs
* Mappings may optionally contain an IncludeDirective which is
  a special key `(@)` with a value of either a String, or a List of Strings.
  This *is* a key/value pair in the mapping and is handled the same as any
  other from the point of view of composition etc.
* Mappings may optionally contain a ConditionalContainer which is
  a special key `(?)` whose value is a list of ConditionalDirectives.
  This *is* a key/value pair in the mapping and is handled the same as any
  other from the point of view of composition etc.
* A ConditionalDirective is represented as a mapping consisting of a single
  key which matches the ConditionSyntax and whose value is a Mapping
* Mappings may optionally contain an Assertion which is a Mapping consisting of
  a single key `(!)` whose value is an error message which will be reported to
  the user before further processing stops.  This occurs only during option
  processing and as such they are only valid as values of ConditionalDirectives
* A CompositionContainer is a Mapping whose entries are all
  CompositionDirectives.  A Mapping which has mixed entries, both compositional
  and non-compositional is considered to be invalid and will result in an error
  lazily during the composition process.
* A CompositionDirective is one of CompositionPrepend, CompositionAppend,
  or CompositionReplace.
* CompositionPrepend is a key `(<)` whose value is a list.
* CompositionAppend is a key `(>)` whose value is a list.
* CompositionReplace is a key `(=)` whose value is a list.

In brief, files are mappings.  Mappings always have string keys.  Values may be
strings, lists, or further mappings.  Lists may be empty, or contain a
heterogenous combination of strings, lists, and mappings.

Certain mapping/key combinations are special (cf. composition etc)

Appendix 2 - Raw data representation in memory

* Mappings are presented as dictionaries with a special key for the provenance data.
  Such a key as a mapping in a YAML file is disallowed.
* Provenance data for a mapping is a tuple of file number, line number, column number
  and represents where the mapping *starts*
* Mapping keys are plain python strings
* String values are represented as tuples of the string itself, and the three location
  elements as per the mapping provenance tuple.
* Empty lists are represented as tuples of a sentinel, and the three location elements.
  where the location is the sequence start.
* Non-empty lists discard the sequence start provenance and rely instead on the provenance
  data of the values in the list.

The file number is simply a monotonic index for files loaded by the loader.  It could
be replaced by an interned string of the file name without impacting performance, though
it would increase serialisation cost if we serialise the data as non-pickle.

By being more strict about the input format and data representation, it is our
expectation that composition and other complex operations may become cheaper;
and it would be possible to move them entirely into compiled code in the future
if further performance gains are required.  For now we propose continuing to
keep all this in Python.

-- 
Daniel Silverstone                          https://www.codethink.co.uk/
Solutions Architect               GPG 4096/R Key Id: 3CCE BABE 206C 3B69


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]