Re: [guppi-list] Re: goose datatype system



Lots of issues to hack through...

On Mon, Nov 30, 1998 at 10:50:11PM +0100, Asger Alstrup Nielsen wrote:
> > The DataSet defines a lot of operations that are really not defined
> > for other data types.  After all, it is completely bogus to calculate
> > the skewness of a set of unordered categorical data.
> 
> Yes, this is a valid concern.
> ...
> I had in mind in the hybrid approach, I talked a bit about.
> Each container will present an interface, where some of the methods are shared
> by all containers (thus we simulate polymorphism). 

I'm starting to favor this approach.  Each contained would then
provide a meaningful, minimal set of routines for accessing the
contents.

> (Is "ordered categorical data" the same as "ordinal" data?  I only
> sure of the Danish terms for these things.)

Yes, they are the same.


> > For dates, what should you be able to do (besides access individual
> > data elements): find the min and max date.  That is the only
> > meaningful operation that I can think of off the top of my head.
> 
> Ah, be a little creative... 
> ... goes on to suggest many statistical tests for sets of dates ...
> Ordinal data:
> Histogram information:  Which is the most common value?
> Lots of tests: 
> Kolmogorow-Smirnov one-sample test
> One-sample run test
> ... lists more tests ...
>
> Categorical data: 
> Histogram information.
> Tests:
> Binomial test.
> Chi^2 goodness-of-fit test.
> ... lists more tests ....

This gets into an issue that has concerned me.  Namely, which
operations should be part of the container and which should stand
alone?

The DataSet has a huge number of member functions, since I've stuffed
all sorts of desciptive statistics into it, from the mundane to the
obscure.  But I don't think that every operation that takes a set of
doubles and gives back a double should be represented by a member
function in DataSet. 

My rule of thumb to date has been that:
* "Basic" operations & descriptive statistics can be member functions.
* More complex or obscure operations should be implemented by external
  functions and classes.

Now I haven't followed this rule very well, as I've put some obscure
stuff into DataSet, and I've also put some very basic stuff outside of
the class.  For example, the normal dist. estimate of the mean is
stuck in normal.{h,cpp}, and that is about as basic and fundamental as
it gets... but it introduces a parametric assumption, and I used that
as an excuse to keep it out of the DataSet.

We just can't let DataSet (and the future DateSet, CategoricalSet,
OrdinalSet, etc.) keep growing and growing.  If I've erred on the side
of inclusion with the DataSet, perhaps we should start out by erring
on the side of exclusion.

Now on the Polymorphism issue, I've changed my mind (slightly) on one
issue.  I think that it would make sense for us to have all of our
various containers derive from some base class... a GooseSet, if you
will.  But the inheritance should be trivial, and the GooseSet should
only offer a strictly minimalistic set of features.  Maybe even
something as simple as:

class GooseSet {
public:
  virtual ~GooseSet();
  size_t size();
  const string& label()
  void set_label();
};

class DataSet : public GooseSet {
...
};

class DateSet : public GooseSet {
...
};

In fact, the GooseSet interface would be so small as to be almost
useless.  The only reason it would be around would be to allow for
heterogeneous "containers of containers", things like
vector<GooseSet*>.

(BTW, DateSet is just too typo-inducingly close to DataSet.  What else
could we call it?)

So between this and the Type/Value system, we'd have both
container-level and element-level polymorphism... which should be
enough for almost any purposes.

> (When I reread this I realize that I'm strongly biased.  Sorry.  To try to
> remedy this, I present a small competition:  If you find some potential
> disadvantages of the distributed approach, please air them, and you'll win a
> virtual beer that can be cashed in if you come to Denmark.)

Oh, I can find potential disadvantages with anything, particularly to
collect beer (virtual or otherwise)... and you didn't say that they
had to be *significant* or *meaningful* disadvantages. :-)

But all kidding aside, I really think that the "distributed" approach
is the way to go.  Let's hammer out some code and see how it look
to us then...



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]