Re: call for collaborators: bug isolation via remote program sampling



Martin H. wrote:
The problem is that nobody knows what this reports reveal (Do you
even know the *practiacal* implications?).  So most likely we have
to assume to reveals any action done by the user and any file read
by the programm.  Is that correct?

I haven't gone into much detail about the reports so far, but that's
just because I wanted to gauge initial interest before burying people
in the particulars.  Since you've raised the question, let me describe
what goes on in a bit more detail.

There are a few variations on what sort of instrumentation we would
insert.  In all cases, there is conceptually a two step process.
First, we add instrumentation at various program points.  Second, we
change that instrumentation so that it is randomly sampled rather than
running all the time.

Here are three examples of instrumentation we have experimented with
so far:

  - assert() and similar statements

      These might be put in explicitly by programmers or they might be
      added automatically by a tool like CCured
      <http://manju.cs.berkeley.edu/ccured/>.

      Directly reporting the results of each assert() call would yield
      a report that grows larger and larger the longer the program
      runs.  Instead, we would maintain a single counter for each
      assert(), and just count how many times each assert() was true
      versus false.

  - function return values

      Keep a triple of counters for each function call.  Each time the
      function is called, bump one of the three counters depending on
      whether the returned value was negative, zero, or positive.
      (For pointer-returning functions, just use two counters: null
      and non-null.  Calls to void-returning functions are not
      instrumented at all.)

  - scalar pairs

      At each scalar assignment "x = ...", find all of the other
      variables that are simultaneously in scope and have the same
      type as "x".  Keep a triple of counters for each of these.
      After the assignment, bump one of the three counters depending
      on whether x's new value is less than, equal to, or greater than
      each other variable.

      Essentially what we're doing here is making wild guesses about
      relationships among program variables.  What we're looking for
      are relationships that hold on successful runs but which are
      violated when the program crashes.  Most of those guesses will
      be wrong or meaningless, but that's just noise that we can
      filter out given enough runs.  We have successfully used this
      instrumentation scheme to isolate a previously unreported buffer
      overrun bug in "bc".

That's the sort of basic instrumentation we're talking about.  Then we
transform it to be sampled randomly rather than every time.  For
example, instead of bumping one counter after every function call, we
might bump only once per hundred or once per thousand times.  (The
actual sampling is random with some average frequency, rather than
being trivially periodic.)

So what does this reveal about user actions?  Well, it's pretty clear
how one could take a report and turn it into code coverage data.  The
coverage information will be noisy due to sparse sampling, but it
would reveal in broad terms which parts of the code were and were not
used on a given run.

Does this reveal more detailed information, such as passwords or file
contents?  Not directly, and only in a very limited indirect sense.
We're not recording or even sampling any values directly.  At no point
does a report include, for example, the value of some (char *)
variable that might hold a password.

Instrumenting the return value for a call to fgetc() would tell us how
often the file being read contained null bytes versus non-null bytes.
That plus counting the number of negative returns would give us an
idea of the average length of files being read on that line.  So
that's a piece of information that these reports would reveal.

Is that sensitive information?  Usually no, but perhaps occasionally
yes.  Is that information that you're willing to trust me with, if I
tell you that reports will be transmitted over an encrypted channel,
stored on a firewalled server, accessed only by me and my research
collaborators, and used for research purposes only with no personally
identifying information?  Well, I hope so.  If not, then that's why
I plan to make opting in/out an explicit decision by each user.

Martin, I don't know if this adequately addresses your concerns.  If
it doesn't, we should figure out what else would be needed.  My
research depends on aggregating large numbers of runs, and can only
succeed if people feel comfortable opting in.  So I want to be sure
that I can address all reasonable privacy concerns.

(I should say that while *I* feel comfortable with these measures, it
may be rather more difficult to explain all of this to a non-technical
user.  I can talk to you folks about variables and functions and
assertions, and you know just what I mean.  A non-programmer really
won't know what to make of all of that, and would therefore find it
harder to make an informed decision.  That's something I'm still
trying to work out.  It's a social issue rather than a technical one,
but it's just as important if this approach is to succeed.)




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]