Re: [guppi-list] Re: use of GSL in guppi

From: "Asger K. Alstrup Nielsen" <alstrup diku dk>
To: guppi-list gnome org
Subject: Re: [guppi-list] Re: use of GSL in guppi
Date: Wed, 16 Sep 1998 09:53:14 +0200 (METDST)
> We're on version 0.0.2.  Nothing is fixed. :-)  Seriously, I guess it
> depends on which data structures we are talking about...

The problem is that we need to allow for missing data.  In a particular
dataset, some of the values will be tagged "missing", "errorneous" or 
something like that.  In Goose, we can settle with two different
states:  Value is valid, and value is invalid.

I've considered three ways of implementing this:

1) Add a boolean flag to every value.  If the flag is false, the value is
   to be ignored.  The problem with this is that we affect performance:
   Everytime we want to access a value, we need to check the flag first.

2) Add an integer index value to every value.  With this, we can change
   the sorting representation to be another index list, rather than the
   duplicated values (which might be a good idea if the type of data
   is abstracted away.  For instance, if multi-precision floats are
   used, the memory impact of duplicates values could be expensive.)
   The idea is of course only to insert the valid values.  I.e. the
   set of indexes is suddenly not continous, but this shouldn't be
   a problem.

3) Use my own datastructure, and just use DataSet as a tool as any
   other, where I will build the table each time I need it.  The problem 
   with this approach is that I loose the constant time statistics.

In case you think that this addition to the data structure is too
dramatic, I'll settle for 3).  In that case, I don't need the
"remove item" method after all.

Personally, I think the option 2) is best.  This approach also fits
nicely with the permutation scheme, you described, for sorting vectors.
Also, it offers a new interpretation of the DataSet:  It is an infinite
list of values, where some of them are missing.  This in turn offers
a nice interpretation of out-of-range errors:  They are simply accessing
missing values.
By using templates, we should be able to hide this extra indexing for
the users that just need the default handling.  We'll use an
"index-trait", similar to the char-traits of the STL.

I don't know whether the index should be usable in performing statistics,
or not.  We could do this in a clean way by offering a 

	DataSet DataSet::indexAsSet() const;

method that returned a new DataSet where the former indexes are converted
to values, and new implicit indexes are returned.

> Since good exception support hasn't really been available under g++ on
> Intel, I've never really use exceptions and am therefore a total
> exceptions idiot.  So if exception-based error handling is added to
> Goose, I most certainly *shouldn't* be the one to do it.

It is simple:  Add a "throw exception" when an error situation occurs.
exception is an instance of a exception class that describe the problem.

It's up to the user to add try-catch constructs to handle it.

> But it sounds like a good idea.  My big approach to error handling up
> until now has been:
>   (1) Silently fail, and reset the data structure in question.  (For
>       example, when applying log_transform() to a DataSet that
>       contains non-positive values.)
>   (2) Return a NAN.  (For example, when asking for the mean() of an
>       empty DataSet.)  But who wants to check every return value w/
>       isnan()?  Not me.
>   (3) Crash.  (No examples leap to mind, but I'm sure that they are in there.)

I agree that none of this is acceptable when we have the option to use 
real exceptions.

I'll have a look at instrumenting the code with exception handling then.

> Should be easy enough to add.  I originally designed the DataSet to be
> a bit opaque and "read-only", rather than just being a glorified
> array.  But commands to remove an item or a range of items should be
> painless to add and run in basically constant time.  (Except you do
> have to sweep the whole array again if one of the items you remove ==
> min_ or max_...)

If this is added, I suddenly remember that we need to be able to add
an item at any place too, before we are completelt there.  This add 
should shift the other values away, such like remove should.

Finally, we need an update() method, which for starters can be implemented
like a remove and add pair (sub-optimal).

I realize we are reimplementing a general array now, but I think it
is worth it.

> Yeah, I've thought about this.  The way I had been planning to do this
> was to introduce a Permutation object and add (at least) two functions
> to DataSet:
>   (1) apply_permutation() would take a permutation and re-order the
>       DataSet accordingly.
>   (2) permutation_sort() would sort the DataSet and then return the
>       Permutation object corresponding to the transformation mapping
>       the original (unsorted) data to the sorted state.  Then if you

That's a nice and clean solution.

> Look away.  I've been coding in a bit of a vacuum, and having other
> people involved and looking at the code always helps.

Ok, for starters, I'll look at the error handling then.

Greets,

Asger Alstrup
Follow-Ups:
- Censored data in Goose
  - From: Jon Trowbridge
References:
- Re: [guppi-list] Re: use of GSL in guppi
  - From: Jon Trowbridge
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]