Re: [BuildStream] Some bad mistakes, lets understand them, and make this work better



Hi Tristan,

Thanks for posting, and for taking the time to include examples.

On Tue, Sep 11, 2018 at 10:04 AM Tristan Van Berkom via BuildStream-list <buildstream-list gnome org> wrote:
TL;DR: We had a regression which causes the artifact cache to get
       corrupted[1], and fixing it was extremely difficult because
       developers are making pretty bad mistakes and they are landing
       on master, outlined in this email. Inviting discussion on how we
       can improve things.


Hi all,

    I'm writing you today because I see evidence that we are all not
doing well enough in review, we are rushing things in too quickly,
regardless of whether it is a new feature, or a frantic bug fix or
performance improvement for our stable release. We have a lot of
developers on hand now, so I don't think we have an excuse.

I think there is more judgement here than is warranted.
 
In the interest of improving things, I am going to call out a bunch of
examples of things which have landed on master in recent months, which
really should never have gone as far as landing on master.

I think it might have been better to lead with how we can improve all the way at the bottom, because I'm pretty sure the majority of the audience will tune out halfway through.

I think the emphasis here should be on what is being paid attention to in review.

Let me be clear: this is not intended as a rant, and I don't intend to
place blame on specific people; I am as much to blame as the rest of
us, as some of my examples involve code which I did not take enough
time to review, I think we all have to slow our role here and be more
careful.

Also, especially to those of us who are more experienced: we have to be
spending more time reviewing code than writing code; we need to set a
good example otherwise peer review really doesn't work and ends up
being just a checkmark that somebody else has looked at the code.

Since we switched to aggressively granting commit access almost 2
months ago[0], we have been allowing some unacceptable things to land,
along with some things which could be better. On the surface, we have
an improved velocity, and while this checks a lot of boxes, this is not
worth while if our standards of quality go completely off the rails,
leaving us with a dead ended codebase.

One thing I have to stress is, when you perform a review; first ask
yourself these questions:

  * Does this patch make the codebase simpler than it was before ?

  * Is the newly added code going to be easy for the next developer to
    read ?

  * Does this patch circumvent encapsulation of logic in modules ?

  * Does this patch add transient state to object instances instead of
    properly propagating transient state through function calls ?

Can we add this to HACKING.rst?
 
Before considering even whether the code works; I want to know that it
it *makes sense*. Bugs are easy to fix, and regressions easy to test
for, but bandaid patches are not acceptable. If your two line patch
breaks encapsulation and inserts logic at an inappropriate place, it is
probably because the surrounding architecture did not account for what
you intend - this means your patch must *modify* the surrounding code
structure in the same patch, even if it turns a two line patch into a
100 line patch, this is the appropriate course of action: The code base
must always move forward as a whole, and never bit rot due to additions
of bandaids over bandaids.


So, I'd like for us to discuss how we can go about doing better, and to
that end I will illustrate some of the problems I encountered this
weekend, while trying to solve the crashes and corrupted artifact
caches which are essentially a result of concurrent commit/prune[1].


A series of ugly stories
========================
[....really useful specific examples...]

The examples above are really good material for HACKING.rst.  Even just linking to this thread might be useful.
 
  Exclusivity of the CACHE resource
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  I hold myself primarily responsible for having let this issue creep
  in in the first place, as it should have been obvious and I was not
  attentive enough in my review of Tristan Maat's code, even if time
  constraints was an issue it is no excuse; it only demonstrates that
  we are rushing things into master way too quickly.

I don't think that we can jump to that conclusion.  Or maybe it's the way that you've phrased it here.  When I read quickly, I am thinking about time.  And time is not a good measure in this case.  A code change doesn't become better with age.

https://gitlab.com/BuildStream/buildstream/merge_requests/347 was started 5 months ago.  It landed 1 month ago through https://gitlab.com/BuildStream/buildstream/merge_requests/548.

There's 113 "discussions" on there, with several senior developers participating.

  In order to implement automatic purging of LRU (least recently used)
  artifacts in the local artifact cache, we needed a bit of churn in
  the scheduler.

  I guided Tristan Maat a lot in the features which would be needed,
  and had to cleanup a bit of a mess he left behind in the logging
  which needlessly duplicated code and caused things to be hard to
  follow.

  In a nutshell; tests were added, there were many iterations of
  review, and by the end I did not have the attention span left to
  properly follow that everything is working properly.

To me this also smells of the long-lived-branch problem.  The problems weren't seen because they weren't landing on master.  And when the whole change did land on master, it was too big to properly review again.  And besides, several people had looked at it already, it baked for 4 months, so it must be good, right?
 
  Essentially, the underlying implementation details were implemented
  properly, basically we added a "read write lock" mechanism in
  _scheduler/resources.py, to ensure that we could handle the
  exclusivity of jobs which require the artifact cache resource, and
  ensure we don't have starvation by prioritizing jobs which require
  exclusive access to a resource.

  The result in the end however was that:
    * Cache size calculation jobs became mutually exclusive to cache
      cleanup jobs
    * The pull jobs and build jobs, which write to the artifact cache,
      did not require the artifact cache resource *at all*
    * Only one job was allowed to access the artifact cache at a time,
      regardless of exclusivity.

  The result is that the whole thing we designed to ensure pull/build
  could not run at the same time as cleanup, was not actually being
  used for this purpose.

After 4 months it's easy to lose track of the true intention / design of the code change, and to lose excitement.

[...] 
In summary, How can we do better ?
==================================
In closing, I think it's appropriate to review the "Project is under
heavy load" discussion we had back in April[2].

I have to say that, I agree with Sander for the most part that the
"land first and fix it later" approach works well in the settings which
I have experienced them. But I should admit that in those circumstances
we did not allow relative novices commit large features without heavy
review either, and with a handful of 5 to 7 full time developers with a
good 10 years of experience under their belts, you only have to worry
about the occasional regression, which you can indeed fix quickly after
the fact.

I'll admit that I didn't quite expect us to be as aggressive with our commit access as we have - we sort of went from one extreme to the other :).  I was expecting us to have existing committers invite others to the set of committers.  Essentially, give commit access to people you feel can be trusted to land a change with confidence, or call out for help when they need it.  In essence that they show good judgement.
I definitely want to be careful with putting people in categories like novice or senior or junior.  I prefer us to think of everyone as peers, each with strengths and weaknesses.  I'm not worried about people overstepping boundaries; all the code is in version control, and we can always go back to a previous state - at a cost.

I am not suggesting that we abandon the approach entirely either, we
have to work on making it work better.

+1.
 
I should also stress that I am not worried about our team members
ability to add regression tests with their patches and make the
software mostly "work", I am worried about dead-ending the overall
codebase by growing warts on it such that there is no longer any
comprehensible design or control flow, at which point no feature can be
added or bug can be fixed without mass rewrites.

What can we do to make things better ?

  * I submit that peer review on it's own is not enough, we need to
    have the more experienced developers perform review for the
    entry level developers.

I want to be careful phrasing it like this.
 
    For more trivial patches, developers should also seek out the
    review of those peers who are more familiar with a given area of
    the code base than the patch submitter.

When someone _lands_ a change in an area that you are familiar with, you will pay more attention.  When a change sits on a branch, waiting for your review, it's a lot easier to let it sit there then to give it your full attention :)
 
  * We should accept that we are going to get poked constantly for
    reviews, this means our more experienced developers need to slow
    down their velocity for the sake of the safety of the project as
    a whole.

    If you didn't have time to review a patch, and nobody appropriate
    had the time to review a patch, it is not acceptable that patches
    keep landing without the review of someone who knows what they
    are doing - it is MUCH better that the feature is postponed if
    it saves us from the kinds of mistakes I've outlined in the horror
    stories above.

I'm off a slightly different mind here.  Typically people will pay attention to what is landing on master - given enough visibility.  We haven't really gotten our -commits@ list sorted properly in that respect.

If we feel that changes are landed prematurely, and effectively people are showing poor judgement, then that's a different story.  We should go back to reviewing commit privileges in that case.

OT: the buildstream-notifications list is a big mess that isn't really helpful in seeing changes; why are the buildgrid notifications going there?  And what's with the prefixes?

  * All developers (or at least "committers") have to take these basic
    concepts seriously, I cannot be the only one who cares about the
    future of the codebase.

I do find it a bit daunting that you posit you're the only one who cares currently.  That isn't inspiring nor motivating for others that put in time to do reviews.
 
    We are a team, we will succeed or fail as a team. In order for us
    to succeed, I need everyone on the team to care.

+1.

  * People need to feel empowered by their commit rights, surely.

    But with that, these people should feel the gravity and
    responsibility which comes with this; commit privileges should be
    understood as a great honor, and no line of code should land
    lightly, without the corresponding gravity of it's side effects.

:).  I don't know why, but sometimes I feel that the way the message is delivered is so gloomy...

I prefer to think of it this way: people should feel responsible.  You should feel especially responsible when you land changes to master, as others will be impacted by them.
However, that responsibility shouldn't crush you and stop you from doing the right thing.  Mistakes will be made, own up to them, help fix them, learn.

  * Managers need to lower their expectations of velocity.

    We have many people on the project now, but as someone
    famous already said:

      "9 women cannot collaborate to create a baby in one month"

    This has been experimented, and always fails.

    We all need to be a team and share the burden of responsibility, if
    our developers are inexperienced, under pressure, and given the
    right to commit their work without thorough review from a senior
    developer; then this is a clear recipe for dead-ending our precious
    code base.

    I submit that senior developers are better equipped to handle
    pressure and should know where to draw the line responsibly and
    say "No" at the right times, while juniors need to be given more
    freedom to complete tasks safely.

I think this is out of scope for the project, and up to the community members themselves.  One thing to point out is that developers on a project do have individual responsibility.  It is their actions we see, not what is happening in the confines of the office of their employer.

  * As also discussed in the "Project is under heavy load" thread[3],
    it may go a long way to maintain some architectural diagrams to
    better guide developers through our codebase.

    I personally don't have experience working on any project which did
    this so I'm not sure how effective it can be, but we certainly need
    to communicate more clearly how components in our code base are
    intended to interact, and what their respective responsibilities
    are, since we are clearly seeing that developers are circumventing
    the architecture to complete their tasks.

Admittedly, when I finally got around to taking your original diagrams and document, and then comparing it to the codebase, things are not quite the same anymore.  There is now a Stream, for instance.  I timed out, and apologize for that.
 
I'm not sure what else we can do here. I think that we need to make
public vs private members, and single vs double underscores, more clear
in the HACKING.rst somehow. The concept is not very difficult, and we
only really deviate from the Python standards because we need to hide
some methods on public classes which our core calls into internally.

We cannot however solve everything with policy in the HACKING.rst,
people have to care about concepts like encapsulation; and the
HACKING.rst cannot become a course on how to write maintainable code.

My suggestions above were really helpful in that respect :D
Seriously, HACKING.rst is already at a point where it needs an index.  Maybe we need to split things out.  An example from the Subversion project that grew over time:   
 http://subversion.apache.org/docs/community-guide/

I would recommend reading through the first two sections just to get an idea.
 
Any other ideas on how we can implement better pre-merge disaster
prevention ?

I'm going to say pre- or post- merge :).

The intent of the change needs to be clear when doing a review.  Knowing what changes are expected to be in there is useful.
Continuing with the same example project as above, I used to appreciate the commit log messages a great deal.  Random example: https://github.com/apache/subversion/commit/d29903f400be65a6398bba58d446fd95b69ce019

The commit message outlines the overall change, and then the change per file (or group of files).  This allows for two things:
- verify that the summary makes sense.
- verify whether the code changes are matching the summary.

If you managed this far, thanks for reading ;).

Cheers,

Sander

[0]: https://mail.gnome.org/archives/buildstream-list/2018-July/msg00017.html
[1]: https://gitlab.com/BuildStream/buildstream/issues/623
[2]: https://mail.gnome.org/archives/buildstream-list/2018-April/msg00021.html
[3]: https://mail.gnome.org/archives/buildstream-list/2018-April/msg00030.html

_______________________________________________
BuildStream-list mailing list
BuildStream-list gnome org
https://mail.gnome.org/mailman/listinfo/buildstream-list
--

Cheers,

Sander


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]