Re: [BuildStream] Speeding up buildstream init



Hi Jonathan,

First of all, thanks for the detailed write up.

On Thu, Aug 16, 2018 at 1:21 PM Jonathan Maw via BuildStream-list <buildstream-list gnome org> wrote:
Over the past week or so, I've been looking at speeding up buildstream
startup times (raised in
https://gitlab.com/BuildStream/buildstream/issues/466).

After spending some time with a profiler, I can divide up the points of
slowdown as follows:
* Calculating the artifact cache size on startup
* Loading the pipeline
   - Reconstructing the pipeline after excluding specified elements
   - Loading the elements
     - Parsing yaml from files
     - Extracting the elements' environment
     - Extracting the elements' variables (also initializing a Variables
object)
     - Extracting the elements' public data
     - Extracting the elements' sandbox config

Below is a detailed analysis of each point of slowdown, with a
description of the problem and a proposed solution.

Calculating the artifact cache size on startup
==============================================

Problem
-------

Currently, buildstream/_context.py has to recursively search through the
artifact cache to calculate its size. This needs to be done to generate
the artifact cache quota, so that buildstream can remove unused
artifacts from the cache if it comes close to running out of space.

Previously, manually setting an artifact cache quota would skip this,
but that is no longer the case, as buildstream now checks that the cache
quota is sensible.

This timesink is extremely apparent for a lot of users, and does not
give any indication of what is happening while this is going on.
The amount of time it takes does not directly scale with the size of the
pipeline, but in a pipeline of 10'000 simple elements which took 262
seconds, 60 of those seconds were in this.

That is extremely significant (~23%) for something that the user doesn't have value for until confronted with a full disk.
 
Solution
--------

Broadly, the solution is to write to a file the artifact cache size, and
read that on boot. If this artifact cache size diverges from reality
then it will be corrected when the reported cache size approaches the
quota size.

Specifically, this will involve:
* Restructure Context to not calculate the artifact cache size
   - It will store the quota size defined in config
   - The actual quota size will be calculated and stored by the artifact
cache
   - The scheduler will read quota size from the artifact cache instead
of context.
* The artifact cache will write to disk its size whenever its size is
set internally
* If the artifact cache does not already know its size, it will read the
size from a file instead of calculating it.

Sounds reasonable.
 
Loading the pipeline - excluding elements
=========================================

Problem
-------

On longer pipelines, a comparatively large amount of time is spent on
excluding
elements, even if no exclusions are listed.

This timesink is comparatively small (in a pipeline of 10'000 simple
elements that took 262, it took 28 seconds), and is included because the
solution is very simple.

28 seconds is still ~11% of the time, which is a good chunk.
 
Solution
--------

In `_pipeline.py:except_elements()`, if `except_targets` is empty, just
return `elements`

That's... shocking.  But let's take it :).
 
Loading the pipeline - parsing yaml from files
==============================================

Problem
-------

The yaml parser we use (ruamel) is slow, but we have been unable to find
a faster one that is capable of doing round-trips (i.e. read yaml from
file, make changes to it, write out yaml with mostly the same
structure).

Do we need the round tripping in the common case (bst build).  The reason I ask is because I assume the number of reads is going to well outweigh the number of writes.  And only in the case of writes do we want to do the actual round tripping.  What are the cases other than track where we are doing round tripping?

Using two ways to deal with the yaml files may introduce unwanted complexity; I'm trying to understand how much of that we would actually face.  Or whether it would be relatively contained.
However, going into detail on that will only make sense if we have a yaml parser that is significantly faster than ruamel.
 
We spend a lot of time loading yaml. In the simple pipeline of 10'000
elements that took 262 seconds, 51 of those were spent loading yaml.

That's a lot (~19%).  What happens to that time with a pipeline that is an order of magnitude bigger?
 
Solution
--------

My proposed solution here is to cache the loaded yaml in a format that's
faster to read back into memory.

Python's built-in object serialisation library, Pickle
(https://docs.python.org/3.5/library/pickle.html) proved to be capable
of serialising the loaded yaml and its provenance data, with one caveat
- inside the provenance data are ProvenanceFile objects, which contain a
reference to the project this file is inside.

Given that ProvenanceFile.project is currently only used for comparison,
not to access any of its members, the simplest solution is to change
that to the name of the project.
If that is not acceptable, I will have to:
* Write a custom pickler/unpickler that has access to a list of all the
projects (`Context` stores the projects, so having a reference to the
context in the pickler/unpickler will be sufficient)
* When the custom pickler tries to work on a ProvenanceFile, it will
store the name of the project instead of the Project object.
* When the custom unpickler tries to reconstruct a ProvenanceFile, it
will look up the project that matches the name and reference that
instead.

With pickling resolved one way or the other, I would then have to:
* In `_yaml.load`, I will consult whether a cache already has an entry
for that filename (plus shortname and project), and that the inode
metadata for that path is the same as when I cached it (i.e. that the
file hasn't been modified)
   - Files across junctions will need special handling, as those files
don't have a persistent place on disk. Instead, they are valid as long
as the junction hasn't changed.
* In `_yaml.load`, if I loaded it from a file, before returning that
data, write it to the cache.

This solution comes with some issues:
* What form should this cache take?
* Should the cache be shareable? It might be important for remote
execution, but unpickling is not *safe*. A malicious file could be used
to insert arbitrary code into buildstream.

I don't think it is a requirement to be shareable.  I don't think remote execution is more significantly impacted than anything else.
 
Now, that said, it might be interesting to investigate how Bazel is solving this problem.  They have a similar graph problem with a larger number of vertices (as they are looking at finer grained translation units).  Now I haven't looked into it, my impression from conversations is that they use a combination of a persistent process after first launch, and serializing the graph to disk when that process is terminated.

Extracting the elements' environment, variables, public data and sandbox
config
===============================================================================

Problem
-------

This is the next-largest time-sink, and in a simple pipeline of 10'000
elements that took 262 seconds, 44 of those were spent on extracting
those fields for the element.

Unfortunately, there doesn't seem to be a simple solution - the majority
of it appears to be string and dict manipulation, and caching the result
would be complicated by the sheer number of ways that it can be
affected.

I think that in the case of an edit-compile cycle, the structure of the .bsts nor the configuration in project.conf will vary much.  The churn will be in the code (a workspaced element).  I think that even caching for the case where nothing has changed would result in a benefit, as consecutive runs of bst build would start up quickly.  Am I missing something there?
 
For example, variables can be affected by:
* The defaults in the element's .yaml file in the source code.
* Overrides to the defaults defined in the project.conf.
* The default project.conf defined in buildstream source code.
* Overrides to the project.conf from user config.
* The default user config definde in buildstream source code.
* The element's bst file.
* Any files included by the bst file.
* Any command-line options specified that the bst file uses.

I will leave this problem alone for now, and come back to it in a later
iteration.

===

Thanks for reading. If you have any particular insights/opinions on what
caching solution I should use, and how to deal with the potential
unsafeness of pickled data, I'd be happy to hear them.

Thanks again for putting this together.
 
Best regards,

Jonathan.

Cheers,

Sander
 
--
Jonathan Maw, Software Engineer, Codethink Ltd.
Codethink privacy policy: https://www.codethink.co.uk/privacy.html
_______________________________________________
BuildStream-list mailing list
BuildStream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list
--

Cheers,

Sander


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]