first kill for the Cooperative Bug Isolation Project

Some of you may already be aware of the Cooperative Bug Isolation Project: <>. This is a research project at UC Berkeley and Stanford University that tries to find bugs by identifying statistically significant differences in program behavior between good and bad runs. We have a public arm, which offers instrumented binaries for popular GNOME packages for anyone to download and use. We have also been using our approach together with scripted runs to generate large numbers of feedback reports quickly.

I am pleased to report that Cooperative Bug Isolation has killed its first GNOME bug. As described in <>, mining data from scripted Rhythmbox runs reveals a mismanaged timeout event source ID that can result in a fairly large number of crashes. These crashes result from memory corruption, so the post-crash stack is essentially useless. Our feedback instrumentation identifies the problem by revealing that crashes are far more likely when a specific g_source_remove() call on a specific line of code returns values greater than zero.

The same problem actually appears twice in Rhythmbox; the second instance may be responsible for previously reported bug <>.

I bring this to the <gnome-bugsquad> and <rhythmbox-devel> lists' attention for two reasons. First, Luis Villa told me that this is cool enough that I should spread the word. :-) Second, if this timeout mismanagement bug appeared twice in Rhythmbox then it may appear in other code too.

The mistake is to keep around a timeout event source ID after the corresponding timeout callback has returned FALSE. When the callback returns FALSE, the timeout event source is implicitly destroyed. That means that this event source's ID number is no longer valid. Keeping it around is the ID number equivalent of having a dangling pointer.

What we see in Rhythmbox is that the event source ID is being retained in a private field of an object even after the callback has returned FALSE. Later on, that object might decide to destroy the event source by calling g_source_remove() on this stale ID. If the ID number has been reassigned to some other event source, that other source will be prematurely destroyed. A positive return value from g_source_remove() indicates that it did find and destroy some unlucky event source. In the case of Rhythmbox, I see this in the form of increased likelihood of crashing when one particular g_source_remove() call returns positive values.

I'm going to repost this description to <desktop-devel-list>, both to document this easy-to-make error and to suggest that developers audit their own code to see if they are doing the same thing.

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]