[BuildStream] Some bad mistakes, lets understand them, and make this work better



TL;DR: We had a regression which causes the artifact cache to get
       corrupted[1], and fixing it was extremely difficult because
       developers are making pretty bad mistakes and they are landing
       on master, outlined in this email. Inviting discussion on how we
       can improve things.


Hi all,

    I'm writing you today because I see evidence that we are all not
doing well enough in review, we are rushing things in too quickly,
regardless of whether it is a new feature, or a frantic bug fix or
performance improvement for our stable release. We have a lot of
developers on hand now, so I don't think we have an excuse.

In the interest of improving things, I am going to call out a bunch of
examples of things which have landed on master in recent months, which
really should never have gone as far as landing on master.

Let me be clear: this is not intended as a rant, and I don't intend to
place blame on specific people; I am as much to blame as the rest of
us, as some of my examples involve code which I did not take enough
time to review, I think we all have to slow our role here and be more
careful.

Also, especially to those of us who are more experienced: we have to be
spending more time reviewing code than writing code; we need to set a
good example otherwise peer review really doesn't work and ends up
being just a checkmark that somebody else has looked at the code.

Since we switched to aggressively granting commit access almost 2
months ago[0], we have been allowing some unacceptable things to land,
along with some things which could be better. On the surface, we have
an improved velocity, and while this checks a lot of boxes, this is not
worth while if our standards of quality go completely off the rails,
leaving us with a dead ended codebase.

One thing I have to stress is, when you perform a review; first ask
yourself these questions:

  * Does this patch make the codebase simpler than it was before ?

  * Is the newly added code going to be easy for the next developer to
    read ?

  * Does this patch circumvent encapsulation of logic in modules ?

  * Does this patch add transient state to object instances instead of
    properly propagating transient state through function calls ?

Before considering even whether the code works; I want to know that it
it *makes sense*. Bugs are easy to fix, and regressions easy to test
for, but bandaid patches are not acceptable. If your two line patch
breaks encapsulation and inserts logic at an inappropriate place, it is
probably because the surrounding architecture did not account for what
you intend - this means your patch must *modify* the surrounding code
structure in the same patch, even if it turns a two line patch into a
100 line patch, this is the appropriate course of action: The code base
must always move forward as a whole, and never bit rot due to additions
of bandaids over bandaids.


So, I'd like for us to discuss how we can go about doing better, and to
that end I will illustrate some of the problems I encountered this
weekend, while trying to solve the crashes and corrupted artifact
caches which are essentially a result of concurrent commit/prune[1].


A series of ugly stories
========================


  Introduction of local private methods, called from outside
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  The _add_artifact_size() and _set_cache_size() functions were added
  as local private methods to ArtifactCache():

     
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L538

  But these private methods are called *outside* of ArtifactCache !

     
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/jobs/cachesizejob.py#L33
     
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/jobs/cleanupjob.py#L33
     
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L98

  The _check_cache_size_real() private function was added to the
  Scheduler(), without even any API documenting comment:

     
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/scheduler.py#L331

  And called from the PullQueue:

     
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/pullqueue.py#L65

  Either this was originally added this way, or someone came around and
  found that they needed to call these methods later and wrecklessly
  called them without making them public first.


  Completely broken encapsulation of ArtifactCache
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Right at the top of ArtifactCache, notice that some things which
  pretty obviously should be local state, are declared as public
  variables:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L81

  Especially: estimated_size, cache_size, cache_quota,
  cache_lower_threshold.

  Lets set aside for a moment the fact that; when you take a step
  back and consider the approach of estimating size and recalculating
  it when you reach the threshold, you never needed to have
  estimated_size and cache_size variables as separate entities; which
  believe me, confuses things a lot... lets only concentrate on the
  fact that there is no encapsulation *at all* here.

  Why is encapsulation important ? 

  Simply put; we want to localize the business logic of given
  "components" in a complex system, so that we can correct defects and
  make enhancements without having to digest the entire complex system
  in one sitting, that's why we have public/private internal API
  contracts. That is why we create moduler components and we carefully
  consider "Who is responsible for doing what ?".

  As a developer, you should have the right to fix a defect in the
  ArtifactCache business logic layer, and only have to look at the
  ArtifactCache itself - and you would if it's API contract is properly
  defined and it's internals are properly sealed up.

  Let's take a closer look at the side effects this had:

  CasCache offenses:

    Directly modifies state in it's calculate_cache_size() 
    implementation:
    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/cascache.py#L532

    Sets cache_size to None in commit():
    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/cascache.py#L125

    To be clear: No. Subclasses do not have more rights to intrude on
    their parent class than anything else, everything must go through a
    well defined mutator at least.

    Accessors are usually preferred, but if an instance variable is
    declared publicly, feel free to *READ* it, but *NEVER* write it.

    Note that in the above, the second offense of setting to None is
    just crazy, it actually has a meaning and causes an intentional
    side effect, which is to force recalculation in the next round, but
    there is not even any documentation or heavy duty comment about
    what this does anywhere.

  Queue base class goes ahead and set's the ArtifactCache.cache_size:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/queue.py#L305

  Then, there are other instances where ArtifactCache.cache_quota is
  checked directly in the scheduler which are less offensive but still
  cause brain damage.

  The moral of this story of encapsulation specifically, is that by
  breaking it you are just making everybody's life difficult. Now you
  have:

    o Two separate variables to store the cache size
    o Various files which read and even *write* the cache size

  Good luck fixing a bug with this situation, or understanding the
  author's intent as to how this was supposed to work in the first
  place. I was unable to fix issue 623 at all without reducing this to
  something comprehensive first; and it took me the whole weekend.

  I should add: Tiago Gomes had done some of the same refactorings
  I did in his branch, and I would like to take this opportunity to
  apologize again in public for having overlooked his refactors, I was
  in a frenzy to fix 623 and didn't take the time.


  Broken / unclear API contract in ArtifactCache
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Before we had multiple artifact cache implementations, and now we
  have only one implementation, the CasCache.

  Currently the abstract class is still in place and continues to
  manage higher level business logic unrelated to the actual cache
  routines, which is quite alright. But don't forget that CasCache
  remains an implementation of the ArtifactCache.

  The ArtifactCache offers a series of virtual methods to be
  implemented by it's subclass, the delegated routines, this includes
  ArtifactCache.remove():

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L349

  Notice how it takes one argument and is not expected to return
  anything ?

  Well, CasCache goes ahead and violates this:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/cascache.py#L564

  First of all, we do not document abstract methods in more than one
  place, we only document the method where it is initially declared,
  the subclasses which implement these methods group them together in a
  section in their class and document the group as something like
  "ArtifactCache methods", so that any reader of the subclass knows
  where to look for the location of the API documenting comments.

  In this case, CasCache has *redocumented* it's version of the
  remove() method which it implements on behalf of ArtifactCache, and
  uses a *different signature*.

  To make things more confusing, the ArtifactCache class which declares
  the delegate remove() function *actually observes the return value*,
  even though the ArtifactCache declaration of remove() clearly states
  that it does not know anything about any return value:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_artifactcache/artifactcache.py#L257

  Obviously, the documentation comment needs to be removed from
  CasCache and the actual abstract method needs to have it's
  documentation updated to reflect the real nature of the delegate
  routine.

  Further, seeing as the CasCache itself adds a "defer_prune" argument
  to the signature, that means that "pruning" becomes a concept known
  to the ArtifactCache. That means that ArtifactCache itself must now
  offer an ArtifactCache.prune() virtual method for it's subclasses to
  potentially implement.

  This remains unfixed in master...


  Transient state added as instance state, "artifact size"
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Some of the more junior folks on our team might not grasp what I'm
  talking about here yet, but should understand when I illustrate the
  problem.

  Here is the addition of the "artifact size":
    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L216

  And it's accessor:
    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L1911

  So now an element has an "artifact size", but what is it for ?

  Well; it's so that we can observe the size of the artifact before
  committing it to the artifact cache, and it's how we obtain the value
  we use for estimating how big the artifact cache might be without
  constantly scratching the disks, of course !

  But then, *why* is it added to Element ? Doesn't Element have enough
  baggage without carrying this variable ? It is simply *set* in
  `Element._assemble()`:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L1658

  Just so that it can be later observed after calling
  `Element._assemble()`.

  In this case, we're fetching the value and propagating it through
  every `ElementJob`, back to the main process:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/jobs/elementjob.py#L112

  Just so that the BuildQueue can access it and add it to the estimated
  size:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L94

  All of this however is quite redundant, if we observe the tree after
  having fixed this.

  All we needed to do is make `Element._assemble()` *return* the size
  of the artifact it commits, and then we can use the Queue API as it
  was intended to be used. Just return the value from the
  Queue.process() implementation:

    
https://gitlab.com/BuildStream/buildstream/blob/958ef8ef61eee2221af68f5e1b77c6976dde749f/buildstream/_scheduler/queues/buildqueue.py#L71

  And then the return value of `Element._assemble()` is available as
  the 'result' parameter in the `Queue.done()` method:

    
https://gitlab.com/BuildStream/buildstream/blob/958ef8ef61eee2221af68f5e1b77c6976dde749f/buildstream/_scheduler/queues/buildqueue.py#L107

  In this story, we've seen that instead of first understanding how the
  code is intended to work, we've worked completely around it,
  introducing some contorted code paths for a purpose that the Queue
  class was already designed to handle, but this is not the ultimate
  point, the point here is that the size of the committed artifact is
  short lived *transient* state, and as such it has no place polluting
  the Element instance.

  Another example of this, which I have not fixed yet, is the
  Element._build_log_path:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/element.py#L218

  Who's sole purpose is to return the log file path as deduced in
  Element._assemble(), again so that it can be retrieved by the
  BuildQueue:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L52

  To be clear: The Element instance variables is not a dumping grounds
  for any transient state which you couldn't figure out how to
  propagate to it's endpoint.

  Adding transient state to the instances is an unacceptable bandaid,
  it just shows that people have been too lazy to either actually
  understand how the surrounding code is already intended to work, or
  too lazy to refactor the surrounding code to allow for such
  propagation in a case where it was previously unaccounted for.

  I.e. The code should *always* be made to make sense, and it should
  always move forward as a whole.


  Crazy complete bypass of scheduler logic for recording failed builds
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Since I touched upon the same code fragment in discussion of the
  build log above, I'll point out the obvious:

    
https://gitlab.com/BuildStream/buildstream/blob/886a836effa2ddaf740a1281487356e775d0c4f0/buildstream/_scheduler/queues/buildqueue.py#L49

  This should never happen.

  We go to great lengths to ensure there is only one code path for a
  given activity, and bypassing the scheduler logic entirely for the
  purpose of launching your job is just not acceptable.

  If there was anything special about this job that is any different
  from other jobs, then you need to add a feature to the Queue, Job, or
  Scheduler in order to provide what is needed, not circumvent the
  entire mechanism.


  Exclusivity of the CACHE resource
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  I hold myself primarily responsible for having let this issue creep
  in in the first place, as it should have been obvious and I was not
  attentive enough in my review of Tristan Maat's code, even if time
  constraints was an issue it is no excuse; it only demonstrates that
  we are rushing things into master way too quickly.

  In order to implement automatic purging of LRU (least recently used)
  artifacts in the local artifact cache, we needed a bit of churn in
  the scheduler.

  I guided Tristan Maat a lot in the features which would be needed,
  and had to cleanup a bit of a mess he left behind in the logging
  which needlessly duplicated code and caused things to be hard to
  follow.

  In a nutshell; tests were added, there were many iterations of
  review, and by the end I did not have the attention span left to
  properly follow that everything is working properly.

  Essentially, the underlying implementation details were implemented
  properly, basically we added a "read write lock" mechanism in
  _scheduler/resources.py, to ensure that we could handle the
  exclusivity of jobs which require the artifact cache resource, and
  ensure we don't have starvation by prioritizing jobs which require
  exclusive access to a resource.

  The result in the end however was that:
    * Cache size calculation jobs became mutually exclusive to cache
      cleanup jobs
    * The pull jobs and build jobs, which write to the artifact cache,
      did not require the artifact cache resource *at all*
    * Only one job was allowed to access the artifact cache at a time,
      regardless of exclusivity.

  The result is that the whole thing we designed to ensure pull/build
  could not run at the same time as cleanup, was not actually being
  used for this purpose.

  Very stupid that this landed after so much attention was spent, it is
  mostly fixed in the following three simple commits:

    https://gitlab.com/BuildStream/buildstream/commit/e5847077e6493e86ec5de09b2efa1f74a6f2d099
    https://gitlab.com/BuildStream/buildstream/commit/98ab2b8916716d8d6dde6eb778c50addff90400b
    https://gitlab.com/BuildStream/buildstream/commit/deaecdc1ad1c51096cc00d364d17cb8b0183a010


In summary, How can we do better ?
==================================
In closing, I think it's appropriate to review the "Project is under
heavy load" discussion we had back in April[2].

I have to say that, I agree with Sander for the most part that the
"land first and fix it later" approach works well in the settings which
I have experienced them. But I should admit that in those circumstances
we did not allow relative novices commit large features without heavy
review either, and with a handful of 5 to 7 full time developers with a
good 10 years of experience under their belts, you only have to worry
about the occasional regression, which you can indeed fix quickly after
the fact.

I am not suggesting that we abandon the approach entirely either, we
have to work on making it work better.

I should also stress that I am not worried about our team members
ability to add regression tests with their patches and make the
software mostly "work", I am worried about dead-ending the overall
codebase by growing warts on it such that there is no longer any
comprehensible design or control flow, at which point no feature can be
added or bug can be fixed without mass rewrites.

What can we do to make things better ?

  * I submit that peer review on it's own is not enough, we need to
    have the more experienced developers perform review for the
    entry level developers.

    For more trivial patches, developers should also seek out the
    review of those peers who are more familiar with a given area of
    the code base than the patch submitter.

  * We should accept that we are going to get poked constantly for
    reviews, this means our more experienced developers need to slow
    down their velocity for the sake of the safety of the project as
    a whole.

    If you didn't have time to review a patch, and nobody appropriate
    had the time to review a patch, it is not acceptable that patches
    keep landing without the review of someone who knows what they
    are doing - it is MUCH better that the feature is postponed if
    it saves us from the kinds of mistakes I've outlined in the horror
    stories above.

  * All developers (or at least "committers") have to take these basic
    concepts seriously, I cannot be the only one who cares about the
    future of the codebase.

    We are a team, we will succeed or fail as a team. In order for us
    to succeed, I need everyone on the team to care.

  * People need to feel empowered by their commit rights, surely.

    But with that, these people should feel the gravity and
    responsibility which comes with this; commit privileges should be
    understood as a great honor, and no line of code should land
    lightly, without the corresponding gravity of it's side effects.

  * Managers need to lower their expectations of velocity.

    We have many people on the project now, but as someone 
    famous already said:

      "9 women cannot collaborate to create a baby in one month"

    This has been experimented, and always fails.

    We all need to be a team and share the burden of responsibility, if
    our developers are inexperienced, under pressure, and given the
    right to commit their work without thorough review from a senior
    developer; then this is a clear recipe for dead-ending our precious
    code base.

    I submit that senior developers are better equipped to handle
    pressure and should know where to draw the line responsibly and
    say "No" at the right times, while juniors need to be given more
    freedom to complete tasks safely.

  * As also discussed in the "Project is under heavy load" thread[3],
    it may go a long way to maintain some architectural diagrams to
    better guide developers through our codebase.

    I personally don't have experience working on any project which did
    this so I'm not sure how effective it can be, but we certainly need
    to communicate more clearly how components in our code base are
    intended to interact, and what their respective responsibilities
    are, since we are clearly seeing that developers are circumventing
    the architecture to complete their tasks.

I'm not sure what else we can do here. I think that we need to make
public vs private members, and single vs double underscores, more clear
in the HACKING.rst somehow. The concept is not very difficult, and we
only really deviate from the Python standards because we need to hide
some methods on public classes which our core calls into internally.

We cannot however solve everything with policy in the HACKING.rst,
people have to care about concepts like encapsulation; and the
HACKING.rst cannot become a course on how to write maintainable code.

Any other ideas on how we can implement better pre-merge disaster
prevention ?




[0]: https://mail.gnome.org/archives/buildstream-list/2018-July/msg00017.html
[1]: https://gitlab.com/BuildStream/buildstream/issues/623
[2]: https://mail.gnome.org/archives/buildstream-list/2018-April/msg00021.html
[3]: https://mail.gnome.org/archives/buildstream-list/2018-April/msg00030.html



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]