[BuildStream] Issues tagged `optimization`



Hi all,

I have been going over all the [issues tagged 'optimization'][1] in the
BuildStream Gitlab project with a view to determining what each of them is
and what might be needed to achieve some progress on them.  Sadly I'm not
experienced enough in all the relevant aspects of the BuildStream design and
source to do this fully.

This mail is, therefore, to disseminate my thoughts thus-far and to encourage
others in the community into discussing and progressing on these topics.  Some
of the topics came up at the Gathering recently, and for those I hope that this
mail acts as a catalyst to cause writeups from those involved in the
discussions.  Writeups should occur mostly on the issue, though if longer
discussions are indicated, then please start a thread here on the ML.

If you are aware of issues which are optimisation related and haven't been
mentioned below, then please update the tags on the issue and follow up to this
thread with a similar summary.

While I cannot set policy for 1.4, where an optimisation strikes me as possible
by 1.4 I have noted such.  Obviously I cannot say who will have time, who will
have interest, and whether the two shall meet :-)

Going in chronological order of the issue being opened, the general
optimisation issues are:

[Track file dependencies][56] filed by Sander Striker on July 31st 2017 talks
about attempting to optimise away execution of build steps if we can determine
that all the inputs which *matter* to the step can be shown to be identical.

A lot of discussion has occurred on this issue since it was filed, including
some treatment of approach and treatment of dependencies.  Tristan hilighted
that we'd need internal Merkele trees in order to do this efficiently.  I'd
note that this is something which the pervasive use of CasCache now has
provided for us.  I think there's still quite a bit of useful discussion to be
had here, particularly around ensuring correctness of caching before
implementation could or should begin.  I think we need to continue the
discussion directly, though I doubt this could be implemented by 1.4 unless
someone has a good chunk of time to dedicate.

[Store artifact manifests in metadata][82] filed by Tristan on August 31st 2017
covers a need to not constantly walk the extracted artifact in the filesystem.
Since we've moved to CasCache and virtual directories, I think this issue bears
revisiting and a consideration given to whether the above affects the issue
described in any useful (or detrimental) way.  Angelos did a some work on the
issue back when it was filed, perhaps he might have some idea whether to take
it forward now?  I imagine this can be considered before 1.4 but if further
work is needed it may not be doable in time.

[Be smarter about querying summaries from remote caches][179] filed by Sam
Thursfield on Jan 4th 2018 covers issues with the old OSTree cache.  A query
was raised at the start of June by laurence over whether this was still an
issue with the cas based cache.  This issue would benefit from someone
experienced in either the cas cache or the old ostree cache to determine if it
can be closed or if it is still relevant in some sense.  I think that check
could easily be done by 1.4, though whether or not we can resolve any lingering
related issue is up in the air.

[Disable fsync and fdatasync in SafeHardlinkOps FUSE filesystem][208] was filed
by Sam Thursfield on Jan 25th 2018 and covers a _potential_ optimisation to be
had in the FUSE filesystem we use to protect the artifact cache during local
builds.  Nothing has been said on the issue since it was filed and so I cannot
judge if anyone has given it any thought.  Sam's point seems sane to me, though
I cannot say if much time would be saved.  Disabling `fsync` and `fdatasync` in
the FUSE filesystem could not harm things since the importing of the outputs
into the CasCache are where we really need that kind of correctness.  This
strikes me as something which could easily be done (or proven not necessary)
before 1.4.

[Be smarter about running integration commands][464] filed by Michael Catanzaro
on July 9th 2018 considers that integration commands are often re-run even
though their inputs have not changed and thus their outputs could be cached.
In my view, this could be related to, or subsumed into, issue 56 above.  We've
spoken in the past about possibly having integration commands state what they
are sensitive to, but that requires humans to describe things which in theory
automation could discover instead of being told.  Tristan did start to look at
this in July though didn't progress far enough to warrant reporting back.  I
think we can certainly consider this when discussing 56, though I'm similarly
not enthusiastic about being able to bottom it before 1.4.

[Optimize bst build initialization time][466] filed by Tristan Maat on July
9th 2018 discusses optimising the *pre scheduler* time of `bst build`.  The
issue has been linked to issues below about *mid scheduler* time, but the focus
here is on what happens before we launch the scheduler.  There is a tiny amount
of overlap in the construction of the element jobs but otherwise they are
fairly independent optimistion opportunities.

Work was undertaken to improve project load time by implementing a cache of
parsed YAML, since YAML loading and parsing is quite IO *and* CPU intensive.
This has improved matters somewhat, but has not resolved the issue as a whole.
Along with the recent efforts into bringing the benchmarking online we have
been looking at optimisation options, but we've not reached any conclusions as
to the next optimisation pathway yet.  If anyone has ideas, then they'd be
gratefully received I'm sure.  I hope that once some new profiles have been
generated for the initialisation of `bst build` they'll be shared and we can
discuss.  It is quite plausible that we'll continue to make improvements in
this area before 1.4 though I'm not sure we'll reach a full "resolved" state
in the remaining few months.

[Cache calculation trashes too much the IO][573] filed by Tiago Gomes on Aug
13th 2018 raises the point that whenever we re-scan the cache to calculate its
size we end up doing a lot of IO and we can end up evicting useful things from
the page cache.  Tiago proposes a fairly "simple" approach to resolving this,
though it will take someone experienced with the CasCache code to decide how
complex it'll turn out to be to implement.  Tiago had a go at implementing a
fix in MR 671 but this MR has been paused for quite some time which points at
the possibility that it needs extra thought.  It'd be helpful if Tiago might
weigh in and explain what the situation is here.

[CAS: Avoid double write for received blobs][678] filed by Jürg Billeter on
September 25th 2018 describes the fact that the CasCache design ends up writing
and re-writing content when fetching and importing content.  There hasn't been
much discussion though Valentin has an MR (830) in progress which might resolve
this issue.  With a bit of luck this will be done by 1.4.

[Queues do too much work during scheduling][703] filed by myself on October
10th 2018 talks about how we do a *lot* of repeated work when processing queues
of jobs.  When a particular queue has a lot of elements which are not yet ready
we do a lot of checking whenever *any* job completes.  A brief discussion with
Tristan revealed that the internal structure of how elements track their state
was written to be clear and simple at the expense of having to recheck things a
lot.  A more intricate but more efficient approach based on events was
considered but at the time was not thought to be worthwhile.  Sadly the
fundamental state of the elements themselves needs to be exactly correct and
going over the codebase to ensure all places where state should be updated are
correctly doing so will be a lot of work.  I think we could discuss and come up
with a plan this year, though I am doubtful there'll be enough time to
implement for 1.4 unless one of Tristan or Jürg take this on, simply because of
the intricacies.

[BuildStream spends a long time pulling/looking to pull before doing anything useful][712]
filed by James Ennis on October 15th 2018 talks about how attempted cache
fetches delay source fetching unreasonably long on larger projects with empty
or non-useful remote artifact caches.  Jürg apparently expressed a thought that
this ought to be simple to fix, and I encourage him to comment on the issue (or
here on the ML) with details so that we can proceed.  I think that, providing
Jürg's idea bears fruit, this is resolvable by 1.4.

[1] https://gitlab.com/BuildStream/buildstream/issues?label_name%5B%5D=Optimization
[56] https://gitlab.com/BuildStream/buildstream/issues/56
[82] https://gitlab.com/BuildStream/buildstream/issues/82
[179] https://gitlab.com/BuildStream/buildstream/issues/179
[208] https://gitlab.com/BuildStream/buildstream/issues/208
[464] https://gitlab.com/BuildStream/buildstream/issues/464
[466] https://gitlab.com/BuildStream/buildstream/issues/466
[573] https://gitlab.com/BuildStream/buildstream/issues/573
[678] https://gitlab.com/BuildStream/buildstream/issues/678
[703] https://gitlab.com/BuildStream/buildstream/issues/703
[712] https://gitlab.com/BuildStream/buildstream/issues/712

In addition, there is one optimisation tagged issue which is really about
benchmarking in a more general sense:

[Performance monitoring][205] raised by Sam Thursfield on January 25th 2018
talks about benchmarking and profiling.  At least some of this work has been
done, though I don't think we meet the full intent of this issue even if we
meet the letter of the goals.  Jim also raised this on the issue though it was
never closed nor responded to.  We've recently revived the benchmark reporting
effort, and Lachlan is working on a gitlab.io instance which we will be able
to look at to get an idea of benchmark performance over time.  We can certainly
look again at the issue and decide what to do before 1.4.

[205] https://gitlab.com/BuildStream/buildstream/issues/205

Thanks, and sorry it's so long a mail,

D.

-- 
Daniel Silverstone                          https://www.codethink.co.uk/
Solutions Architect               GPG 4096/R Key Id: 3CCE BABE 206C 3B69


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]