Re: [BuildStream] Some bad mistakes, lets understand them, and make this work better



Hi Sander,

On Tue, 2018-09-11 at 16:54 +0100, Sander Striker wrote:
Hi Tristan,

Thanks for posting, and for taking the time to include examples.

And thank you for actually reading through it, and for your thoughtful
reply.

On Tue, Sep 11, 2018 at 10:04 AM Tristan Van Berkom via BuildStream-list <buildstream-list gnome org> 
wrote:
TL;DR: We had a regression which causes the artifact cache to get
       corrupted[1], and fixing it was extremely difficult because
       developers are making pretty bad mistakes and they are landing
       on master, outlined in this email. Inviting discussion on how we
       can improve things.


Hi all,

    I'm writing you today because I see evidence that we are all not
doing well enough in review, we are rushing things in too quickly,
regardless of whether it is a new feature, or a frantic bug fix or
performance improvement for our stable release. We have a lot of
developers on hand now, so I don't think we have an excuse.

I think there is more judgement here than is warranted.

Perhaps. I tried to clarify this is not about blame, and if I'm passing
judgment, I think it is on all of us and I want to share in that as
well.

But to take a step back and put things into perspective...

  * Last week this issue came up with the artifact cache, that by
    itself is not a problem.

  * I tried fixing it with a few simple commits, but that was not
    enough to solve it, that is also fine.

  * Finally, when looking more deeply at what is happening related to
    calculation of cache size, I could not in fact even understand what
    is going on in there.

  * Not understanding the implementation is one thing, but more than
    this, I could not even understand what was *intended* to be going
    on in there by a reading of the code.

  * I have to conclude that there is not a single intention, but rather
    a few crossed intentions.

This raises a big red flag to me, or was rather; quite a shock.

Now to circle back to the intent of my message: I think that if people
were to see the examples I've laid out here; they will empathize and
share in the shock which I experienced.

By making sure that people understand how only a few bad choices
(especially breaking of encapsulation, but the other rules of thumb
also apply) here and there, can add up to something that is entirely
incomprehensible; I hope to raise awareness such that this kind of
thing doesn't need to happen in master.

In the interest of improving things, I am going to call out a bunch of
examples of things which have landed on master in recent months, which
really should never have gone as far as landing on master.

I think it might have been better to lead with how we can improve all
the way at the bottom, because I'm pretty sure the majority of the
audience will tune out halfway through.

Definitely, I didn't order that in the best way.

I think the emphasis here should be on what is being paid attention
to in review.
Let me be clear: this is not intended as a rant, and I don't intend to
place blame on specific people; I am as much to blame as the rest of
us, as some of my examples involve code which I did not take enough
time to review, I think we all have to slow our role here and be more
careful.

Also, especially to those of us who are more experienced: we have to be
spending more time reviewing code than writing code; we need to set a
good example otherwise peer review really doesn't work and ends up
being just a checkmark that somebody else has looked at the code.

Since we switched to aggressively granting commit access almost 2
months ago[0], we have been allowing some unacceptable things to land,
along with some things which could be better. On the surface, we have
an improved velocity, and while this checks a lot of boxes, this is not
worth while if our standards of quality go completely off the rails,
leaving us with a dead ended codebase.

One thing I have to stress is, when you perform a review; first ask
yourself these questions:

  * Does this patch make the codebase simpler than it was before ?

  * Is the newly added code going to be easy for the next developer to
    read ?

  * Does this patch circumvent encapsulation of logic in modules ?

  * Does this patch add transient state to object instances instead of
    properly propagating transient state through function calls ?

Can we add this to HACKING.rst?

I worry that the more we throw into HACKING.rst, the more it becomes a
sort of black hole where we put policy statements that nobody is going
to actually read... that said after looking at your subversion example,
I am significantly swayed.

We can probably come up with a few more points here based on the rest
of this email... but these points on their own are pretty solid.

Before considering even whether the code works; I want to know that it
it *makes sense*. Bugs are easy to fix, and regressions easy to test
for, but bandaid patches are not acceptable. If your two line patch
breaks encapsulation and inserts logic at an inappropriate place, it is
probably because the surrounding architecture did not account for what
you intend - this means your patch must *modify* the surrounding code
structure in the same patch, even if it turns a two line patch into a
100 line patch, this is the appropriate course of action: The code base
must always move forward as a whole, and never bit rot due to additions
of bandaids over bandaids.


So, I'd like for us to discuss how we can go about doing better, and to
that end I will illustrate some of the problems I encountered this
weekend, while trying to solve the crashes and corrupted artifact
caches which are essentially a result of concurrent commit/prune[1].


A series of ugly stories
========================

[....really useful specific examples...]

The examples above are really good material for HACKING.rst.  Even
just linking to this thread might be useful.
 

I am happy to make HACKING.rst more elaborate and structured, maybe if
we structure it well enough we can point to specific sections of it,
and then we can hope that at least the parts of it we point at will be
read.

  Exclusivity of the CACHE resource
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  I hold myself primarily responsible for having let this issue creep
  in in the first place, as it should have been obvious and I was not
  attentive enough in my review of Tristan Maat's code, even if time
  constraints was an issue it is no excuse; it only demonstrates that
  we are rushing things into master way too quickly.

I don't think that we can jump to that conclusion.  Or maybe it's the
way that you've phrased it here.  When I read quickly, I am thinking
about time.  And time is not a good measure in this case.  A code
change doesn't become better with age.

https://gitlab.com/BuildStream/buildstream/merge_requests/347 was
started 5 months ago.  It landed 1 month ago
through https://gitlab.com/BuildStream/buildstream/merge_requests/548
.

By gitlab's "A month ago" fuzzy logic perhaps; it landed July 18th,
though.

I agree that it took quite some time, but in this particular case, I
really feel that Tristan Maat was out of his depth and this required a
lot of design phase before implementation. So I helped a lot with
designing the changes which needed to happen to the scheduler.

Before continuing on this point; I should point out that this example
should not have been in the email.

I wanted to point out that I made a mistake of being careless, and
thinking that "If I reviewed it a bunch of times, it must be okay" was
irresponsible of me.

However, this is not one of those obvious mistakes (or wrong choices)
which I can really argue that "It shouldn't have ever landed in
master".

Breaking encapsulation, calling private functions, and adding transient
state to instance variables, pretty much all of the other examples I
listed here are things we should be able to avoid off the bat.

There's 113 "discussions" on there, with several senior developers participating.

  In order to implement automatic purging of LRU (least recently used)
  artifacts in the local artifact cache, we needed a bit of churn in
  the scheduler.

  I guided Tristan Maat a lot in the features which would be needed,
  and had to cleanup a bit of a mess he left behind in the logging
  which needlessly duplicated code and caused things to be hard to
  follow.
  In a nutshell; tests were added, there were many iterations of
  review, and by the end I did not have the attention span left to
  properly follow that everything is working properly.

To me this also smells of the long-lived-branch problem.  The
problems weren't seen because they weren't landing on master.  And
when the whole change did land on master, it was too big to properly
review again.  And besides, several people had looked at it already,
it baked for 4 months, so it must be good, right?

So, I quite strongly disagree with this.

Deciding that a patch makes the cut due to the number of review rounds
which previously occurred can only make sense if the branch under
review is really only churning slightly in response to the review
rounds - which really wasn't the case here.

<off topic>
I should add though, this was under development during 1.1.x, which was
a time when landing stuff in master effects users.

I keep trying to get this message across: the 1.1 cycle *was special*.

The first stable 1.0 series was very immature, we couldn't really
expect existing users to use it. The 1.0 branch was still however
*important* in order to implement the right processes to handle API
stability correctly, and have a healthy release cycle in place - call
the 1.0 release a "practice round" if you like, it was still useful.
<off topic>

With that said, it would be much more acceptable to land the artifact
expiry branch with *less* review, in the first few months of 1.3.x
(probably not something to push in near the end of a cycle).

Still, it was my mistake to group that with the other kinds of clear
mistakes which could have been obviously avoided. Reminiscent of the
"One of these things" game we used to play on sesame street.

  Essentially, the underlying implementation details were implemented
  properly, basically we added a "read write lock" mechanism in
  _scheduler/resources.py, to ensure that we could handle the
  exclusivity of jobs which require the artifact cache resource, and
  ensure we don't have starvation by prioritizing jobs which require
  exclusive access to a resource.

  The result in the end however was that:
    * Cache size calculation jobs became mutually exclusive to cache
      cleanup jobs
    * The pull jobs and build jobs, which write to the artifact cache,
      did not require the artifact cache resource *at all*
    * Only one job was allowed to access the artifact cache at a time,
      regardless of exclusivity.

  The result is that the whole thing we designed to ensure pull/build
  could not run at the same time as cleanup, was not actually being
  used for this purpose.

After 4 months it's easy to lose track of the true intention / design
of the code change, and to lose excitement.

[...] 
In summary, How can we do better ?
==================================
In closing, I think it's appropriate to review the "Project is under
heavy load" discussion we had back in April[2].

I have to say that, I agree with Sander for the most part that the
"land first and fix it later" approach works well in the settings which
I have experienced them. But I should admit that in those circumstances
we did not allow relative novices commit large features without heavy
review either, and with a handful of 5 to 7 full time developers with a
good 10 years of experience under their belts, you only have to worry
about the occasional regression, which you can indeed fix quickly after
the fact.

I'll admit that I didn't quite expect us to be as aggressive with our
commit access as we have - we sort of went from one extreme to the
other :).  I was expecting us to have existing committers invite
others to the set of committers.  Essentially, give commit access to
people you feel can be trusted to land a change with confidence, or
call out for help when they need it.  In essence that they show good
judgement.

I very strongly agree with you and would prefer a tighter circle of
committers myself.

I'm not sure how we can really describe that in terms of policy,
though, and it was raised to me that we probably need to be acting on
policy.

I can see that side of the coin, even though I personally have greater
confidence in a completely ad hoc approach where people use critical
thinking to decide to hand out commit access on a case by case basis
(it wouldn't be perceived as "fair", but I'm not sure it needs to be).

I definitely want to be careful with putting people in categories
like novice or senior or junior.  I prefer us to think of everyone as
peers, each with strengths and weaknesses.  I'm not worried about
people overstepping boundaries; all the code is in version control,
and we can always go back to a previous state - at a cost.

Well, I think that what you are saying in the last paragraph shows that
you agree that we have a higher level of trust with some actors, and
those are preferably who we would choose as committers.

I might be wrong to use those words, but to me; usually demonstrated
experience in maintaining software, learning from past mistakes which
led to unmaintainable code and having to aggressively refactor - is an
indicator that they are considering the right things when landing
patches.

That said; this is my explanation of why I chose those words, I am fine
with other vocabulary, or accepting more broadly that some people are
better at some things.

I am not suggesting that we abandon the approach entirely either, we
have to work on making it work better.

+1.
 
I should also stress that I am not worried about our team members
ability to add regression tests with their patches and make the
software mostly "work", I am worried about dead-ending the overall
codebase by growing warts on it such that there is no longer any
comprehensible design or control flow, at which point no feature can be
added or bug can be fixed without mass rewrites.

What can we do to make things better ?

  * I submit that peer review on it's own is not enough, we need to
    have the more experienced developers perform review for the
    entry level developers.

I want to be careful phrasing it like this.
 
    For more trivial patches, developers should also seek out the
    review of those peers who are more familiar with a given area of
    the code base than the patch submitter.

When someone _lands_ a change in an area that you are familiar with,
you will pay more attention.  When a change sits on a branch, waiting
for your review, it's a lot easier to let it sit there then to give
it your full attention :)

Of course that view is also valid, on the other hand that branch is not
doing any harm while it is not landed in master - once you land
something which needs a redesign in master, others branches start to
build on top of something broken, and things get crooked quickly.

I am sure that for the category of mistakes, or bad choices, which I've
called out - we need to do better at making sure those never hit master
in the first place.

The reason for this is because I in fact agree with you that we should
land more branches and sometimes have to fix problems after, rather
than stagnating *too* long in review.

Subtle design details can be corrected by people who are familiar with
a part of the code post landing, if they didnt get a chance to review
first, of have later changed their mind - of course; but the cost will
be too high if this category of problem is landing in master on a
regular basis - and I think that it is.

  * We should accept that we are going to get poked constantly for
    reviews, this means our more experienced developers need to slow
    down their velocity for the sake of the safety of the project as
    a whole.

    If you didn't have time to review a patch, and nobody appropriate
    had the time to review a patch, it is not acceptable that patches
    keep landing without the review of someone who knows what they
    are doing - it is MUCH better that the feature is postponed if
    it saves us from the kinds of mistakes I've outlined in the horror
    stories above.

I'm off a slightly different mind here.  Typically people will pay
attention to what is landing on master - given enough visibility.  We
haven't really gotten our -commits@ list sorted properly in that
respect.

If we feel that changes are landed prematurely, and effectively
people are showing poor judgement, then that's a different story.  We
should go back to reviewing commit privileges in that case.

I feel that we are.

But more accurately I feel that we are not making changes in the right
ways in the first place, and would rather emphasize on that, rather
than the timing.

The timing will adjust itself if we are more critical of what we deem
acceptable to land, also.

OT: the buildstream-notifications list is a big mess that isn't
really helpful in seeing changes; why are the buildgrid notifications
going there?  And what's with the prefixes?

  * All developers (or at least "committers") have to take these basic
    concepts seriously, I cannot be the only one who cares about the
    future of the codebase.

I do find it a bit daunting that you posit you're the only one who
cares currently.  That isn't inspiring nor motivating for others that
put in time to do reviews.

Yes, bad me, that was drastically phrased.

As I've also mentioned in the same email; I know that Tiago
specifically has demonstrated that he cares enough to submit patches
which refactor some of the things I've called out in this email.

That said, I find myself rather alone on this front which is sad, and
more seriously: I am not enough people to refactor the code base at the
rate which others are landing features.

I do want to see more patches which don't actually close issues, but
just fix code that is redundant, misleading, or outright dangerous.

Also, "consideration" is a better word here than "caring".
Consideration of the side effects your patch has to others who will
have to contribute to the same code base, takes a conscious effort, and
we need more of that effort.

I don't believe that "people don't care", but I believe that where
there is lack of consideration, it is mostly a lack of experience.

We just need to communicate better, why we make the technical decisions
we make in review (which I'm trying to illustrate with my examples of
what not to do), so that others who end up being reviewers, make the
same considerations we do.

    We are a team, we will succeed or fail as a team. In order for us
    to succeed, I need everyone on the team to care.

+1.

  * People need to feel empowered by their commit rights, surely.

    But with that, these people should feel the gravity and
    responsibility which comes with this; commit privileges should be
    understood as a great honor, and no line of code should land
    lightly, without the corresponding gravity of it's side effects.

:).  I don't know why, but sometimes I feel that the way the message
is delivered is so gloomy...

I prefer to think of it this way: people should feel responsible. 
You should feel especially responsible when you land changes to
master, as others will be impacted by them.
However, that responsibility shouldn't crush you and stop you from
doing the right thing.  Mistakes will be made, own up to them, help
fix them, learn.

I'm going to the extreme a bit with my emphasis but yes I agree with
you indeed.


  * Managers need to lower their expectations of velocity.

    We have many people on the project now, but as someone 
    famous already said:

      "9 women cannot collaborate to create a baby in one month"

    This has been experimented, and always fails.

    We all need to be a team and share the burden of responsibility, if
    our developers are inexperienced, under pressure, and given the
    right to commit their work without thorough review from a senior
    developer; then this is a clear recipe for dead-ending our precious
    code base.

    I submit that senior developers are better equipped to handle
    pressure and should know where to draw the line responsibly and
    say "No" at the right times, while juniors need to be given more
    freedom to complete tasks safely.

I think this is out of scope for the project, and up to the community
members themselves.  One thing to point out is that developers on a
project do have individual responsibility.  It is their actions we
see, not what is happening in the confines of the office of their
employer.

Maybe this is out of scope, but managers are a part of our team too.

Also, when I see the category of problems I have highlighted, I can
only imagine that some of these things landed because the developers
are rushed, so it's important to consider and at least call out, all
the aspects here, even if some of these aspects are out of our control.

  * As also discussed in the "Project is under heavy load" thread[3],
    it may go a long way to maintain some architectural diagrams to
    better guide developers through our codebase.

    I personally don't have experience working on any project which did
    this so I'm not sure how effective it can be, but we certainly need
    to communicate more clearly how components in our code base are
    intended to interact, and what their respective responsibilities
    are, since we are clearly seeing that developers are circumventing
    the architecture to complete their tasks.

Admittedly, when I finally got around to taking your original
diagrams and document, and then comparing it to the codebase, things
are not quite the same anymore.  There is now a Stream, for
instance.  I timed out, and apologize for that.

Yes, since that document was created before the first line of code was
written, I am very happy to see just how much it actually resembles the
beast we've created :)

I'm not sure what else we can do here. I think that we need to make
public vs private members, and single vs double underscores, more clear
in the HACKING.rst somehow. The concept is not very difficult, and we
only really deviate from the Python standards because we need to hide
some methods on public classes which our core calls into internally.

We cannot however solve everything with policy in the HACKING.rst,
people have to care about concepts like encapsulation; and the
HACKING.rst cannot become a course on how to write maintainable code.

My suggestions above were really helpful in that respect :D
Seriously, HACKING.rst is already at a point where it needs an
index.  Maybe we need to split things out.  An example from the
Subversion project that grew over time:   
 http://subversion.apache.org/docs/community-guide/

I would recommend reading through the first two sections just to get an idea.

This does look well structured and quite elaborate, without necessarily
being too onerous for anyone to read.

I did comment above that it can be a bit of a black hole where we only
tick boxes that make us feel like we've advertised policy, but I agree
that from your example it can be actually useful.

Any other ideas on how we can implement better pre-merge disaster
prevention ?

I'm going to say pre- or post- merge :).

The intent of the change needs to be clear when doing a review. 
Knowing what changes are expected to be in there is useful.
Continuing with the same example project as above, I used to
appreciate the commit log messages a great deal.  Random
example: https://github.com/apache/subversion/commit/d29903f400be65a6
398bba58d446fd95b69ce019

The commit message outlines the overall change, and then the change
per file (or group of files).  This allows for two things:
- verify that the summary makes sense.
- verify whether the code changes are matching the summary.

On my part, I usually observe the commit messages when reviewing a
merge request, and only sometimes do I review a merge request commit by
commit, usually to ensure we are not including the history of how the
feature was created and fixed into mainline.

That said, I should probably review commit by commit more often and if
people do that, we should see improvements there.

I have another technical proposal to solve parts of the issues in my
examples, which is an approach I considered when originally writing up
the public facing APIs we have for Element / Source etc, but at the
time I wrote the idea off as over engineering.

Basically the idea is like this:

  * For every class that is public API, we have a second class that is
    private API, for the sake of using trendy names, lets call these
    classes Controllers.

    E.g. ElementController, SourceController, etc.

  * The controllers hold public functions for BuildStream internals
    to *call*, but all state is still held on the actual public
    instances.

  * Only the controllers can access the private members of their
    public counterparts.

This approach could allow us to actually lint for any private member
accesses, by only excluding the controllers themselves from this
linting.

Then we could use the regular python conventions for public/private
members and their notations, hopefully making things all around more
clear.

Of course, it's significant churn and costs some effort, feature
branches would conflict a lot; but the refactor would be rather
mechanical and straight forward to carry out (albeit still a lot of
work).

Perhaps this change is worth it for the added linting securities it can
provide.


If you managed this far, thanks for reading ;).

It was a great reply, thanks for writing it !

Cheers,
    -Tristan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]