[BuildStream] Some bad mistakes, lets understand them, and make this work better
- From: Tristan Van Berkom <tristan vanberkom codethink co uk>
- To: BuildStream <buildstream-list gnome org>
- Subject: [BuildStream] Some bad mistakes, lets understand them, and make this work better
- Date: Tue, 11 Sep 2018 17:03:54 +0900
TL;DR: We had a regression which causes the artifact cache to get
corrupted[1], and fixing it was extremely difficult because
developers are making pretty bad mistakes and they are landing
on master, outlined in this email. Inviting discussion on how we
can improve things.
Hi all,
I'm writing you today because I see evidence that we are all not
doing well enough in review, we are rushing things in too quickly,
regardless of whether it is a new feature, or a frantic bug fix or
performance improvement for our stable release. We have a lot of
developers on hand now, so I don't think we have an excuse.
In the interest of improving things, I am going to call out a bunch of
examples of things which have landed on master in recent months, which
really should never have gone as far as landing on master.
Let me be clear: this is not intended as a rant, and I don't intend to
place blame on specific people; I am as much to blame as the rest of
us, as some of my examples involve code which I did not take enough
time to review, I think we all have to slow our role here and be more
careful.
Also, especially to those of us who are more experienced: we have to be
spending more time reviewing code than writing code; we need to set a
good example otherwise peer review really doesn't work and ends up
being just a checkmark that somebody else has looked at the code.
Since we switched to aggressively granting commit access almost 2
months ago[0], we have been allowing some unacceptable things to land,
along with some things which could be better. On the surface, we have
an improved velocity, and while this checks a lot of boxes, this is not
worth while if our standards of quality go completely off the rails,
leaving us with a dead ended codebase.
One thing I have to stress is, when you perform a review; first ask
yourself these questions:
* Does this patch make the codebase simpler than it was before ?
* Is the newly added code going to be easy for the next developer to
read ?
* Does this patch circumvent encapsulation of logic in modules ?
* Does this patch add transient state to object instances instead of
properly propagating transient state through function calls ?
Before considering even whether the code works; I want to know that it
it *makes sense*. Bugs are easy to fix, and regressions easy to test
for, but bandaid patches are not acceptable. If your two line patch
breaks encapsulation and inserts logic at an inappropriate place, it is
probably because the surrounding architecture did not account for what
you intend - this means your patch must *modify* the surrounding code
structure in the same patch, even if it turns a two line patch into a
100 line patch, this is the appropriate course of action: The code base
must always move forward as a whole, and never bit rot due to additions
of bandaids over bandaids.
So, I'd like for us to discuss how we can go about doing better, and to
that end I will illustrate some of the problems I encountered this
weekend, while trying to solve the crashes and corrupted artifact
caches which are essentially a result of concurrent commit/prune[1].
A series of ugly stories
========================
Introduction of local private methods, called from outside
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The _add_artifact_size() and _set_cache_size() functions were added
as local private methods to ArtifactCache():
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L538
But these private methods are called *outside* of ArtifactCache !
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/jobs/cachesizejob.py#L33
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/jobs/cleanupjob.py#L33
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L98
The _check_cache_size_real() private function was added to the
Scheduler(), without even any API documenting comment:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/scheduler.py#L331
And called from the PullQueue:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/pullqueue.py#L65
Either this was originally added this way, or someone came around and
found that they needed to call these methods later and wrecklessly
called them without making them public first.
Completely broken encapsulation of ArtifactCache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Right at the top of ArtifactCache, notice that some things which
pretty obviously should be local state, are declared as public
variables:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L81
Especially: estimated_size, cache_size, cache_quota,
cache_lower_threshold.
Lets set aside for a moment the fact that; when you take a step
back and consider the approach of estimating size and recalculating
it when you reach the threshold, you never needed to have
estimated_size and cache_size variables as separate entities; which
believe me, confuses things a lot... lets only concentrate on the
fact that there is no encapsulation *at all* here.
Why is encapsulation important ?
Simply put; we want to localize the business logic of given
"components" in a complex system, so that we can correct defects and
make enhancements without having to digest the entire complex system
in one sitting, that's why we have public/private internal API
contracts. That is why we create moduler components and we carefully
consider "Who is responsible for doing what ?".
As a developer, you should have the right to fix a defect in the
ArtifactCache business logic layer, and only have to look at the
ArtifactCache itself - and you would if it's API contract is properly
defined and it's internals are properly sealed up.
Let's take a closer look at the side effects this had:
CasCache offenses:
Directly modifies state in it's calculate_cache_size()
implementation:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/cascache.py#L532
Sets cache_size to None in commit():
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/cascache.py#L125
To be clear: No. Subclasses do not have more rights to intrude on
their parent class than anything else, everything must go through a
well defined mutator at least.
Accessors are usually preferred, but if an instance variable is
declared publicly, feel free to *READ* it, but *NEVER* write it.
Note that in the above, the second offense of setting to None is
just crazy, it actually has a meaning and causes an intentional
side effect, which is to force recalculation in the next round, but
there is not even any documentation or heavy duty comment about
what this does anywhere.
Queue base class goes ahead and set's the ArtifactCache.cache_size:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/queue.py#L305
Then, there are other instances where ArtifactCache.cache_quota is
checked directly in the scheduler which are less offensive but still
cause brain damage.
The moral of this story of encapsulation specifically, is that by
breaking it you are just making everybody's life difficult. Now you
have:
o Two separate variables to store the cache size
o Various files which read and even *write* the cache size
Good luck fixing a bug with this situation, or understanding the
author's intent as to how this was supposed to work in the first
place. I was unable to fix issue 623 at all without reducing this to
something comprehensive first; and it took me the whole weekend.
I should add: Tiago Gomes had done some of the same refactorings
I did in his branch, and I would like to take this opportunity to
apologize again in public for having overlooked his refactors, I was
in a frenzy to fix 623 and didn't take the time.
Broken / unclear API contract in ArtifactCache
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Before we had multiple artifact cache implementations, and now we
have only one implementation, the CasCache.
Currently the abstract class is still in place and continues to
manage higher level business logic unrelated to the actual cache
routines, which is quite alright. But don't forget that CasCache
remains an implementation of the ArtifactCache.
The ArtifactCache offers a series of virtual methods to be
implemented by it's subclass, the delegated routines, this includes
ArtifactCache.remove():
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L349
Notice how it takes one argument and is not expected to return
anything ?
Well, CasCache goes ahead and violates this:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/cascache.py#L564
First of all, we do not document abstract methods in more than one
place, we only document the method where it is initially declared,
the subclasses which implement these methods group them together in a
section in their class and document the group as something like
"ArtifactCache methods", so that any reader of the subclass knows
where to look for the location of the API documenting comments.
In this case, CasCache has *redocumented* it's version of the
remove() method which it implements on behalf of ArtifactCache, and
uses a *different signature*.
To make things more confusing, the ArtifactCache class which declares
the delegate remove() function *actually observes the return value*,
even though the ArtifactCache declaration of remove() clearly states
that it does not know anything about any return value:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L257
Obviously, the documentation comment needs to be removed from
CasCache and the actual abstract method needs to have it's
documentation updated to reflect the real nature of the delegate
routine.
Further, seeing as the CasCache itself adds a "defer_prune" argument
to the signature, that means that "pruning" becomes a concept known
to the ArtifactCache. That means that ArtifactCache itself must now
offer an ArtifactCache.prune() virtual method for it's subclasses to
potentially implement.
This remains unfixed in master...
Transient state added as instance state, "artifact size"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some of the more junior folks on our team might not grasp what I'm
talking about here yet, but should understand when I illustrate the
problem.
Here is the addition of the "artifact size":
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L216
And it's accessor:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L1911
So now an element has an "artifact size", but what is it for ?
Well; it's so that we can observe the size of the artifact before
committing it to the artifact cache, and it's how we obtain the value
we use for estimating how big the artifact cache might be without
constantly scratching the disks, of course !
But then, *why* is it added to Element ? Doesn't Element have enough
baggage without carrying this variable ? It is simply *set* in
`Element._assemble()`:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L1658
Just so that it can be later observed after calling
`Element._assemble()`.
In this case, we're fetching the value and propagating it through
every `ElementJob`, back to the main process:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/jobs/elementjob.py#L112
Just so that the BuildQueue can access it and add it to the estimated
size:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L94
All of this however is quite redundant, if we observe the tree after
having fixed this.
All we needed to do is make `Element._assemble()` *return* the size
of the artifact it commits, and then we can use the Queue API as it
was intended to be used. Just return the value from the
Queue.process() implementation:
https://gitlab.com/BuildStream/buildstream/blob/958ef8ef61eee2221af68f5e1b77c6976dde749f/buildstream/_scheduler/queues/buildqueue.py#L71
And then the return value of `Element._assemble()` is available as
the 'result' parameter in the `Queue.done()` method:
https://gitlab.com/BuildStream/buildstream/blob/958ef8ef61eee2221af68f5e1b77c6976dde749f/buildstream/_scheduler/queues/buildqueue.py#L107
In this story, we've seen that instead of first understanding how the
code is intended to work, we've worked completely around it,
introducing some contorted code paths for a purpose that the Queue
class was already designed to handle, but this is not the ultimate
point, the point here is that the size of the committed artifact is
short lived *transient* state, and as such it has no place polluting
the Element instance.
Another example of this, which I have not fixed yet, is the
Element._build_log_path:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L218
Who's sole purpose is to return the log file path as deduced in
Element._assemble(), again so that it can be retrieved by the
BuildQueue:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L52
To be clear: The Element instance variables is not a dumping grounds
for any transient state which you couldn't figure out how to
propagate to it's endpoint.
Adding transient state to the instances is an unacceptable bandaid,
it just shows that people have been too lazy to either actually
understand how the surrounding code is already intended to work, or
too lazy to refactor the surrounding code to allow for such
propagation in a case where it was previously unaccounted for.
I.e. The code should *always* be made to make sense, and it should
always move forward as a whole.
Crazy complete bypass of scheduler logic for recording failed builds
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since I touched upon the same code fragment in discussion of the
build log above, I'll point out the obvious:
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L49
This should never happen.
We go to great lengths to ensure there is only one code path for a
given activity, and bypassing the scheduler logic entirely for the
purpose of launching your job is just not acceptable.
If there was anything special about this job that is any different
from other jobs, then you need to add a feature to the Queue, Job, or
Scheduler in order to provide what is needed, not circumvent the
entire mechanism.
Exclusivity of the CACHE resource
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I hold myself primarily responsible for having let this issue creep
in in the first place, as it should have been obvious and I was not
attentive enough in my review of Tristan Maat's code, even if time
constraints was an issue it is no excuse; it only demonstrates that
we are rushing things into master way too quickly.
In order to implement automatic purging of LRU (least recently used)
artifacts in the local artifact cache, we needed a bit of churn in
the scheduler.
I guided Tristan Maat a lot in the features which would be needed,
and had to cleanup a bit of a mess he left behind in the logging
which needlessly duplicated code and caused things to be hard to
follow.
In a nutshell; tests were added, there were many iterations of
review, and by the end I did not have the attention span left to
properly follow that everything is working properly.
Essentially, the underlying implementation details were implemented
properly, basically we added a "read write lock" mechanism in
_scheduler/resources.py, to ensure that we could handle the
exclusivity of jobs which require the artifact cache resource, and
ensure we don't have starvation by prioritizing jobs which require
exclusive access to a resource.
The result in the end however was that:
* Cache size calculation jobs became mutually exclusive to cache
cleanup jobs
* The pull jobs and build jobs, which write to the artifact cache,
did not require the artifact cache resource *at all*
* Only one job was allowed to access the artifact cache at a time,
regardless of exclusivity.
The result is that the whole thing we designed to ensure pull/build
could not run at the same time as cleanup, was not actually being
used for this purpose.
Very stupid that this landed after so much attention was spent, it is
mostly fixed in the following three simple commits:
https://gitlab.com/BuildStream/buildstream/commit/e5847077e6493e86ec5de09b2efa1f74a6f2d099
https://gitlab.com/BuildStream/buildstream/commit/98ab2b8916716d8d6dde6eb778c50addff90400b
https://gitlab.com/BuildStream/buildstream/commit/deaecdc1ad1c51096cc00d364d17cb8b0183a010
In summary, How can we do better ?
==================================
In closing, I think it's appropriate to review the "Project is under
heavy load" discussion we had back in April[2].
I have to say that, I agree with Sander for the most part that the
"land first and fix it later" approach works well in the settings which
I have experienced them. But I should admit that in those circumstances
we did not allow relative novices commit large features without heavy
review either, and with a handful of 5 to 7 full time developers with a
good 10 years of experience under their belts, you only have to worry
about the occasional regression, which you can indeed fix quickly after
the fact.
I am not suggesting that we abandon the approach entirely either, we
have to work on making it work better.
I should also stress that I am not worried about our team members
ability to add regression tests with their patches and make the
software mostly "work", I am worried about dead-ending the overall
codebase by growing warts on it such that there is no longer any
comprehensible design or control flow, at which point no feature can be
added or bug can be fixed without mass rewrites.
What can we do to make things better ?
* I submit that peer review on it's own is not enough, we need to
have the more experienced developers perform review for the
entry level developers.
For more trivial patches, developers should also seek out the
review of those peers who are more familiar with a given area of
the code base than the patch submitter.
* We should accept that we are going to get poked constantly for
reviews, this means our more experienced developers need to slow
down their velocity for the sake of the safety of the project as
a whole.
If you didn't have time to review a patch, and nobody appropriate
had the time to review a patch, it is not acceptable that patches
keep landing without the review of someone who knows what they
are doing - it is MUCH better that the feature is postponed if
it saves us from the kinds of mistakes I've outlined in the horror
stories above.
* All developers (or at least "committers") have to take these basic
concepts seriously, I cannot be the only one who cares about the
future of the codebase.
We are a team, we will succeed or fail as a team. In order for us
to succeed, I need everyone on the team to care.
* People need to feel empowered by their commit rights, surely.
But with that, these people should feel the gravity and
responsibility which comes with this; commit privileges should be
understood as a great honor, and no line of code should land
lightly, without the corresponding gravity of it's side effects.
* Managers need to lower their expectations of velocity.
We have many people on the project now, but as someone
famous already said:
"9 women cannot collaborate to create a baby in one month"
This has been experimented, and always fails.
We all need to be a team and share the burden of responsibility, if
our developers are inexperienced, under pressure, and given the
right to commit their work without thorough review from a senior
developer; then this is a clear recipe for dead-ending our precious
code base.
I submit that senior developers are better equipped to handle
pressure and should know where to draw the line responsibly and
say "No" at the right times, while juniors need to be given more
freedom to complete tasks safely.
* As also discussed in the "Project is under heavy load" thread[3],
it may go a long way to maintain some architectural diagrams to
better guide developers through our codebase.
I personally don't have experience working on any project which did
this so I'm not sure how effective it can be, but we certainly need
to communicate more clearly how components in our code base are
intended to interact, and what their respective responsibilities
are, since we are clearly seeing that developers are circumventing
the architecture to complete their tasks.
I'm not sure what else we can do here. I think that we need to make
public vs private members, and single vs double underscores, more clear
in the HACKING.rst somehow. The concept is not very difficult, and we
only really deviate from the Python standards because we need to hide
some methods on public classes which our core calls into internally.
We cannot however solve everything with policy in the HACKING.rst,
people have to care about concepts like encapsulation; and the
HACKING.rst cannot become a course on how to write maintainable code.
Any other ideas on how we can implement better pre-merge disaster
prevention ?
[0]: https://mail.gnome.org/archives/buildstream-list/2018-July/msg00017.html
[1]: https://gitlab.com/BuildStream/buildstream/issues/623
[2]: https://mail.gnome.org/archives/buildstream-list/2018-April/msg00021.html
[3]: https://mail.gnome.org/archives/buildstream-list/2018-April/msg00030.html
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]