[BuildStream] Speeding up buildstream init
- From: Jonathan Maw <jonathan maw codethink co uk>
- To: buildstream-list gnome org
- Subject: [BuildStream] Speeding up buildstream init
- Date: Thu, 16 Aug 2018 12:21:22 +0100
Over the past week or so, I've been looking at speeding up buildstream
startup times (raised in
https://gitlab.com/BuildStream/buildstream/issues/466).
After spending some time with a profiler, I can divide up the points of
slowdown as follows:
* Calculating the artifact cache size on startup
* Loading the pipeline
- Reconstructing the pipeline after excluding specified elements
- Loading the elements
- Parsing yaml from files
- Extracting the elements' environment
- Extracting the elements' variables (also initializing a Variables
object)
- Extracting the elements' public data
- Extracting the elements' sandbox config
Below is a detailed analysis of each point of slowdown, with a
description of the problem and a proposed solution.
Calculating the artifact cache size on startup
==============================================
Problem
-------
Currently, buildstream/_context.py has to recursively search through the
artifact cache to calculate its size. This needs to be done to generate
the artifact cache quota, so that buildstream can remove unused
artifacts from the cache if it comes close to running out of space.
Previously, manually setting an artifact cache quota would skip this,
but that is no longer the case, as buildstream now checks that the cache
quota is sensible.
This timesink is extremely apparent for a lot of users, and does not
give any indication of what is happening while this is going on.
The amount of time it takes does not directly scale with the size of the
pipeline, but in a pipeline of 10'000 simple elements which took 262
seconds, 60 of those seconds were in this.
Solution
--------
Broadly, the solution is to write to a file the artifact cache size, and
read that on boot. If this artifact cache size diverges from reality
then it will be corrected when the reported cache size approaches the
quota size.
Specifically, this will involve:
* Restructure Context to not calculate the artifact cache size
- It will store the quota size defined in config
- The actual quota size will be calculated and stored by the artifact
cache
- The scheduler will read quota size from the artifact cache instead
of context.
* The artifact cache will write to disk its size whenever its size is
set internally
* If the artifact cache does not already know its size, it will read the
size from a file instead of calculating it.
Loading the pipeline - excluding elements
=========================================
Problem
-------
On longer pipelines, a comparatively large amount of time is spent on
excluding
elements, even if no exclusions are listed.
This timesink is comparatively small (in a pipeline of 10'000 simple
elements that took 262, it took 28 seconds), and is included because the
solution is very simple.
Solution
--------
In `_pipeline.py:except_elements()`, if `except_targets` is empty, just
return `elements`
Loading the pipeline - parsing yaml from files
==============================================
Problem
-------
The yaml parser we use (ruamel) is slow, but we have been unable to find
a faster one that is capable of doing round-trips (i.e. read yaml from
file, make changes to it, write out yaml with mostly the same
structure).
We spend a lot of time loading yaml. In the simple pipeline of 10'000
elements that took 262 seconds, 51 of those were spent loading yaml.
Solution
--------
My proposed solution here is to cache the loaded yaml in a format that's
faster to read back into memory.
Python's built-in object serialisation library, Pickle
(https://docs.python.org/3.5/library/pickle.html) proved to be capable
of serialising the loaded yaml and its provenance data, with one caveat
- inside the provenance data are ProvenanceFile objects, which contain a
reference to the project this file is inside.
Given that ProvenanceFile.project is currently only used for comparison,
not to access any of its members, the simplest solution is to change
that to the name of the project.
If that is not acceptable, I will have to:
* Write a custom pickler/unpickler that has access to a list of all the
projects (`Context` stores the projects, so having a reference to the
context in the pickler/unpickler will be sufficient)
* When the custom pickler tries to work on a ProvenanceFile, it will
store the name of the project instead of the Project object.
* When the custom unpickler tries to reconstruct a ProvenanceFile, it
will look up the project that matches the name and reference that
instead.
With pickling resolved one way or the other, I would then have to:
* In `_yaml.load`, I will consult whether a cache already has an entry
for that filename (plus shortname and project), and that the inode
metadata for that path is the same as when I cached it (i.e. that the
file hasn't been modified)
- Files across junctions will need special handling, as those files
don't have a persistent place on disk. Instead, they are valid as long
as the junction hasn't changed.
* In `_yaml.load`, if I loaded it from a file, before returning that
data, write it to the cache.
This solution comes with some issues:
* What form should this cache take?
* Should the cache be shareable? It might be important for remote
execution, but unpickling is not *safe*. A malicious file could be used
to insert arbitrary code into buildstream.
Extracting the elements' environment, variables, public data and sandbox
config
===============================================================================
Problem
-------
This is the next-largest time-sink, and in a simple pipeline of 10'000
elements that took 262 seconds, 44 of those were spent on extracting
those fields for the element.
Unfortunately, there doesn't seem to be a simple solution - the majority
of it appears to be string and dict manipulation, and caching the result
would be complicated by the sheer number of ways that it can be
affected.
For example, variables can be affected by:
* The defaults in the element's .yaml file in the source code.
* Overrides to the defaults defined in the project.conf.
* The default project.conf defined in buildstream source code.
* Overrides to the project.conf from user config.
* The default user config definde in buildstream source code.
* The element's bst file.
* Any files included by the bst file.
* Any command-line options specified that the bst file uses.
I will leave this problem alone for now, and come back to it in a later
iteration.
===
Thanks for reading. If you have any particular insights/opinions on what
caching solution I should use, and how to deal with the potential
unsafeness of pickled data, I'd be happy to hear them.
Best regards,
Jonathan.
--
Jonathan Maw, Software Engineer, Codethink Ltd.
Codethink privacy policy: https://www.codethink.co.uk/privacy.html
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]