Censored data in Goose



A while ago, Asger was talking about how to handle censored data in a
Goose DataSet.  We now return to that thread, already in progress...

On Wed, Sep 16, 1998 at 09:53:14AM +0200, Asger K. Alstrup Nielsen wrote:
> The problem is that we need to allow for missing data.  In a particular
> dataset, some of the values will be tagged "missing", "errorneous" or 
> something like that.  In Goose, we can settle with two different
> states:  Value is valid, and value is invalid.
> 
> I've considered three ways of implementing this:
> 
> 1) Add a boolean flag to every value.  If the flag is false, the value is
>    to be ignored.  The problem with this is that we affect performance:
>    Everytime we want to access a value, we need to check the flag first.

I agree that this is kind of gross.  Inevitably code will get written
that loops over the data without checking the validity flag, so that
code will break on DataSets with censored data.  And checking validity
everywhere (including inside of inner loops) sucks.

> 2) Add an integer index value to every value.  With this, we can change
>    the sorting representation to be another index list, rather than the
>    duplicated values (which might be a good idea if the type of data
>    is abstracted away.  For instance, if multi-precision floats are
>    used, the memory impact of duplicates values could be expensive.)
>    The idea is of course only to insert the valid values.  I.e. the
>    set of indexes is suddenly not continous, but this shouldn't be
>    a problem.

This gets a little gross, because suddenly looking up a data value is
no longer a constant-time operation... you have to search for the
value with the matching index value.

Also, you lose the ability to just get at the DataSet as a regular old
C array of doubles, via the data() command.  Or at least, this stops
being constant time as well.  You'd have to build up some sort of
array of doubles with NANs stuck in for all of the censored data, or
something.  It would be gross.  And then all of the code that just
goes straight to the const double* for efficiency would get littered
with isnan() checks.  Yuck.

> 3) Use my own datastructure, and just use DataSet as a tool as any
>    other, where I will build the table each time I need it.  The problem 
>    with this approach is that I loose the constant time statistics.

I agree that this is a drastic step.  It might be the best way to
go... I'm not sure.  What I'd like to see is a solution that satisfies
what you might call the "Stroustrup Doctrine" --- adding the feature
should impose essentially no overhead on people who choose not to use
that feature.

So maybe one could do something like (2), but only have the index be
there for DataSets that have censored data.  In other cases, the index
would just be implicit, and wouldn't be allocated.

On the other hand, maybe the most logical thing to do would just be to
make DataSet NAN-aware.  Have it do the right thing if you add() a
NAN, add a valid() query.  Things might get a bit gross as we'd have
to pepper the code with calls to valid() and isnan(), but at least we
wouldn't be hopelessly cluttering DataSet's design.

What do you think?

-JT



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]