[BuildStream] Speeding up buildstream init



Over the past week or so, I've been looking at speeding up buildstream startup times (raised in https://gitlab.com/BuildStream/buildstream/issues/466).

After spending some time with a profiler, I can divide up the points of slowdown as follows:
* Calculating the artifact cache size on startup
* Loading the pipeline
  - Reconstructing the pipeline after excluding specified elements
  - Loading the elements
    - Parsing yaml from files
    - Extracting the elements' environment
- Extracting the elements' variables (also initializing a Variables object)
    - Extracting the elements' public data
    - Extracting the elements' sandbox config

Below is a detailed analysis of each point of slowdown, with a description of the problem and a proposed solution.

Calculating the artifact cache size on startup
==============================================

Problem
-------

Currently, buildstream/_context.py has to recursively search through the artifact cache to calculate its size. This needs to be done to generate the artifact cache quota, so that buildstream can remove unused artifacts from the cache if it comes close to running out of space.

Previously, manually setting an artifact cache quota would skip this, but that is no longer the case, as buildstream now checks that the cache quota is sensible.

This timesink is extremely apparent for a lot of users, and does not give any indication of what is happening while this is going on. The amount of time it takes does not directly scale with the size of the pipeline, but in a pipeline of 10'000 simple elements which took 262 seconds, 60 of those seconds were in this.

Solution
--------

Broadly, the solution is to write to a file the artifact cache size, and read that on boot. If this artifact cache size diverges from reality then it will be corrected when the reported cache size approaches the quota size.

Specifically, this will involve:
* Restructure Context to not calculate the artifact cache size
  - It will store the quota size defined in config
- The actual quota size will be calculated and stored by the artifact cache - The scheduler will read quota size from the artifact cache instead of context. * The artifact cache will write to disk its size whenever its size is set internally * If the artifact cache does not already know its size, it will read the size from a file instead of calculating it.

Loading the pipeline - excluding elements
=========================================

Problem
-------

On longer pipelines, a comparatively large amount of time is spent on excluding
elements, even if no exclusions are listed.

This timesink is comparatively small (in a pipeline of 10'000 simple elements that took 262, it took 28 seconds), and is included because the solution is very simple.

Solution
--------

In `_pipeline.py:except_elements()`, if `except_targets` is empty, just return `elements`

Loading the pipeline - parsing yaml from files
==============================================

Problem
-------

The yaml parser we use (ruamel) is slow, but we have been unable to find a faster one that is capable of doing round-trips (i.e. read yaml from file, make changes to it, write out yaml with mostly the same structure).

We spend a lot of time loading yaml. In the simple pipeline of 10'000 elements that took 262 seconds, 51 of those were spent loading yaml.

Solution
--------

My proposed solution here is to cache the loaded yaml in a format that's faster to read back into memory.

Python's built-in object serialisation library, Pickle (https://docs.python.org/3.5/library/pickle.html) proved to be capable of serialising the loaded yaml and its provenance data, with one caveat - inside the provenance data are ProvenanceFile objects, which contain a reference to the project this file is inside.

Given that ProvenanceFile.project is currently only used for comparison, not to access any of its members, the simplest solution is to change that to the name of the project.
If that is not acceptable, I will have to:
* Write a custom pickler/unpickler that has access to a list of all the projects (`Context` stores the projects, so having a reference to the context in the pickler/unpickler will be sufficient) * When the custom pickler tries to work on a ProvenanceFile, it will store the name of the project instead of the Project object. * When the custom unpickler tries to reconstruct a ProvenanceFile, it will look up the project that matches the name and reference that instead.

With pickling resolved one way or the other, I would then have to:
* In `_yaml.load`, I will consult whether a cache already has an entry for that filename (plus shortname and project), and that the inode metadata for that path is the same as when I cached it (i.e. that the file hasn't been modified) - Files across junctions will need special handling, as those files don't have a persistent place on disk. Instead, they are valid as long as the junction hasn't changed. * In `_yaml.load`, if I loaded it from a file, before returning that data, write it to the cache.

This solution comes with some issues:
* What form should this cache take?
* Should the cache be shareable? It might be important for remote execution, but unpickling is not *safe*. A malicious file could be used to insert arbitrary code into buildstream.

Extracting the elements' environment, variables, public data and sandbox config
===============================================================================

Problem
-------

This is the next-largest time-sink, and in a simple pipeline of 10'000 elements that took 262 seconds, 44 of those were spent on extracting those fields for the element.

Unfortunately, there doesn't seem to be a simple solution - the majority of it appears to be string and dict manipulation, and caching the result would be complicated by the sheer number of ways that it can be affected.

For example, variables can be affected by:
* The defaults in the element's .yaml file in the source code.
* Overrides to the defaults defined in the project.conf.
* The default project.conf defined in buildstream source code.
* Overrides to the project.conf from user config.
* The default user config definde in buildstream source code.
* The element's bst file.
* Any files included by the bst file.
* Any command-line options specified that the bst file uses.

I will leave this problem alone for now, and come back to it in a later iteration.

===

Thanks for reading. If you have any particular insights/opinions on what caching solution I should use, and how to deal with the potential unsafeness of pickled data, I'd be happy to hear them.

Best regards,

Jonathan.


--
Jonathan Maw, Software Engineer, Codethink Ltd.
Codethink privacy policy: https://www.codethink.co.uk/privacy.html


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]