[BuildStream] Speeding up buildstream init

From: Jonathan Maw <jonathan maw codethink co uk>
To: buildstream-list gnome org
Subject: [BuildStream] Speeding up buildstream init
Date: Thu, 16 Aug 2018 12:21:22 +0100

Over the past week or so, I've been looking at speeding up buildstreamstartup times (raised inhttps://gitlab.com/BuildStream/buildstream/issues/466).

After spending some time with a profiler, I can divide up the points ofslowdown as follows:

* Calculating the artifact cache size on startup
* Loading the pipeline
  - Reconstructing the pipeline after excluding specified elements
  - Loading the elements
    - Parsing yaml from files
    - Extracting the elements' environment

- Extracting the elements' variables (also initializing a Variablesobject)

    - Extracting the elements' public data
    - Extracting the elements' sandbox config

Below is a detailed analysis of each point of slowdown, with adescription of the problem and a proposed solution.


Calculating the artifact cache size on startup
==============================================

Problem
-------

Currently, buildstream/_context.py has to recursively search through theartifact cache to calculate its size. This needs to be done to generatethe artifact cache quota, so that buildstream can remove unusedartifacts from the cache if it comes close to running out of space.

Previously, manually setting an artifact cache quota would skip this,but that is no longer the case, as buildstream now checks that the cachequota is sensible.

This timesink is extremely apparent for a lot of users, and does notgive any indication of what is happening while this is going on.The amount of time it takes does not directly scale with the size of thepipeline, but in a pipeline of 10'000 simple elements which took 262seconds, 60 of those seconds were in this.


Solution
--------

Broadly, the solution is to write to a file the artifact cache size, andread that on boot. If this artifact cache size diverges from realitythen it will be corrected when the reported cache size approaches thequota size.


Specifically, this will involve:
* Restructure Context to not calculate the artifact cache size
  - It will store the quota size defined in config

- The actual quota size will be calculated and stored by the artifactcache- The scheduler will read quota size from the artifact cache insteadof context.* The artifact cache will write to disk its size whenever its size isset internally* If the artifact cache does not already know its size, it will read thesize from a file instead of calculating it.


Loading the pipeline - excluding elements
=========================================

Problem
-------

On longer pipelines, a comparatively large amount of time is spent onexcluding

elements, even if no exclusions are listed.

This timesink is comparatively small (in a pipeline of 10'000 simpleelements that took 262, it took 28 seconds), and is included because thesolution is very simple.


Solution
--------

In `_pipeline.py:except_elements()`, if `except_targets` is empty, justreturn `elements`


Loading the pipeline - parsing yaml from files
==============================================

Problem
-------

The yaml parser we use (ruamel) is slow, but we have been unable to finda faster one that is capable of doing round-trips (i.e. read yaml fromfile, make changes to it, write out yaml with mostly the samestructure).

We spend a lot of time loading yaml. In the simple pipeline of 10'000elements that took 262 seconds, 51 of those were spent loading yaml.


Solution
--------

My proposed solution here is to cache the loaded yaml in a format that'sfaster to read back into memory.

Python's built-in object serialisation library, Pickle(https://docs.python.org/3.5/library/pickle.html) proved to be capableof serialising the loaded yaml and its provenance data, with one caveat- inside the provenance data are ProvenanceFile objects, which contain areference to the project this file is inside.

Given that ProvenanceFile.project is currently only used for comparison,not to access any of its members, the simplest solution is to changethat to the name of the project.

If that is not acceptable, I will have to:

* Write a custom pickler/unpickler that has access to a list of all theprojects (`Context` stores the projects, so having a reference to thecontext in the pickler/unpickler will be sufficient)* When the custom pickler tries to work on a ProvenanceFile, it willstore the name of the project instead of the Project object.* When the custom unpickler tries to reconstruct a ProvenanceFile, itwill look up the project that matches the name and reference thatinstead.


With pickling resolved one way or the other, I would then have to:

* In `_yaml.load`, I will consult whether a cache already has an entryfor that filename (plus shortname and project), and that the inodemetadata for that path is the same as when I cached it (i.e. that thefile hasn't been modified)- Files across junctions will need special handling, as those filesdon't have a persistent place on disk. Instead, they are valid as longas the junction hasn't changed.* In `_yaml.load`, if I loaded it from a file, before returning thatdata, write it to the cache.


This solution comes with some issues:
* What form should this cache take?

* Should the cache be shareable? It might be important for remoteexecution, but unpickling is not *safe*. A malicious file could be usedto insert arbitrary code into buildstream.

Extracting the elements' environment, variables, public data and sandboxconfig

===============================================================================

Problem
-------

This is the next-largest time-sink, and in a simple pipeline of 10'000elements that took 262 seconds, 44 of those were spent on extractingthose fields for the element.

Unfortunately, there doesn't seem to be a simple solution - the majorityof it appears to be string and dict manipulation, and caching the resultwould be complicated by the sheer number of ways that it can beaffected.


For example, variables can be affected by:
* The defaults in the element's .yaml file in the source code.
* Overrides to the defaults defined in the project.conf.
* The default project.conf defined in buildstream source code.
* Overrides to the project.conf from user config.
* The default user config definde in buildstream source code.
* The element's bst file.
* Any files included by the bst file.
* Any command-line options specified that the bst file uses.

I will leave this problem alone for now, and come back to it in a lateriteration.

===

Thanks for reading. If you have any particular insights/opinions on whatcaching solution I should use, and how to deal with the potentialunsafeness of pickled data, I'd be happy to hear them.


Best regards,

Jonathan.


--
Jonathan Maw, Software Engineer, Codethink Ltd.
Codethink privacy policy: https://www.codethink.co.uk/privacy.html

Follow-Ups:
- Re: [BuildStream] Speeding up buildstream init
  - From: Sander Striker

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]