Re: [guppi-list] goose datatype system



> The DataSet defines a lot of operations that are really not defined
> for other data types.  After all, it is completely bogus to calculate
> the skewness of a set of unordered categorical data.

Yes, this is a valid concern.
To solve this, it's logical to define a class for each container.  This is what
I had in mind in the hybrid approach, I talked a bit about.
Each container will present an interface, where some of the methods are shared
by all containers (thus we simulate polymorphism). 
The raw calculations can be done in any way the coder sees fit.  More on this
further below.

> For dates, what should you be able to do (besides access individual
> data elements): find the min and max date.  That is the only
> meaningful operation that I can think of off the top of my head.

Ah, be a little creative:  Think of the dates as being collected from the log
of a web-server. Then it makes sense to ask lots of questions:  Which weekdays
are the busiest?  Is there a bottleneck around christmas?

If the data are dates of when people are born, we can ask for seasonal
variations, and  statistical "outliers", i.e. find days where the power had
failed ;-)

> Ordered categorical data: first and last category, and an easy way to
> get a list of all categories.

(Is "ordered categorical data" the same as "ordinal" data?  I only sure of the
Danish terms for these things.)

Histogram information:  Which is the most common value?
Lots of tests: 
Kolmogorow-Smirnov one-sample test
One-sample run test
Change points test
Sign test
Wilcoxon signed ranks test
etc.

> Unordered categorical data: nothing, except for the list of
> categories.

(Is "unordered categorical data" the same as "nominal" data?)

Histogram information.
Tests:
Binomial test.
Chi^2 goodness-of-fit test.
McNemar change test
Fisher exact test for 2x2 tables
etc.

> Another thing to think about is that the "double" approach gives us a
> sort of polymorphism for free, but it might actually be inappropriate.
> For example, most statistical operations are not flexible in this
> regard: either a test requires categorical data, or it doesn't.  We
> don't gain anything by having everything be a DataSet (except for
> avoid the need to write a lot of code), 

Agree.

> just as we don't gain anything
> by defining a DataSetBase virtual class and using it as the base class
> for everything, and then defining our functions to take DataSetBase
> pointers as args...

I'm not sure what you mean by this.

But we learn that the problem of polymorphism is not easily solved in C++. But
we have to solve it, and do it in the best way we can think of.

Using templates seems to be the cleanest way on a theoretical scale, but
unfortunately in practice it's troublesome because of buggy compilers.  The
AsciiImport experiment proved this.  So personally I've abandoned the idea of
basing the solution on a templates.  Templates are useful for small corners,
but it's too risky to base the kernel on them at this point in compiler
technology.

The other alternatives we have considered so far are these:

"The class hierarchy"
We can use an abstract DataSetBase class, and build a class hierarchy.
This imposes a run-time penalty of an extra dereference for most operations.
On the pro side, this approach opens the possibilty of a clean semantic
separation, where we can control exactly what methods are valid, and where the
implementation can benefit from  similarities between instances by using
inheritance.

"The double approach"
Use the DataSet as the primary container.
Next to the DataSet, we have supporting classes that help convert types to and
from the double type.
The advantage is that this is very easy to implement.
The main disadvantage is that there is no clean semantic separation.  In the
extreme, we risk to implement strange methods for doubles that only apply to
dates (see the examples mentioned for dates above.)
Another minor disadvantage is that we limit ourselves to one dimensional data.
I.e. two dimensional datasets have to be handled in special ways (there are
provisions for doing this with the Permutation class, but it's only intended to
be useful for a simulation of one-dimensional tuples, not for instance a
graphical image.  This is only a minor disadvantage, because multi-dimensional
data are arguably beyond the scope of Goose.)
Yet another disadvantage is that handling missing data becomes very difficult.

"The distributed approach"
On paper only we define a DataSetConcept, that all containers have to conform
to.  This "schema" is similar to the DataSetBase class in the class hierarchy
approach, except that we don't inherit from it, and the polymorphism is moved
out of it.
The polymorphism aspect of things is instead put into a Type/Value class
hierarchy (which I present a prototype for in the cvs code), and this provides
a way for us to be polymorphic without having to resolve to inheritance.  I.e.
the implementors of the DataSetConcept provide the methods in the
DataSetConcept, and extend the interface to provide the statistical operations
that make sense on the type they represent.
Notice that this Type/Value system is only meant for communication of single
elements.  In particular, the Ascii Import engine will insert values one at a
time, and with interactive editing, we have to support extracts of single
values, and inserting of new ones in a polymorphic way. Each container will
probably not represent the data as a vector<Value>, but rather chose what-ever
representation is appropriate for the data at hand.

The advantage of this approach is similar to the class hierarchy approach: The
semantic separation between types is clean, except that it's a bit looser:  The
compiler will not be able to warn us if we do not conform to the
DataSetConcept.  (I don't think this will be a problem in practice.)

Another advantage is that this design shares with the class hierarchy approach,
is that it opens up for using the double approach where it is advantageous: 
Any given container can chose to base the implementation of the DataSet class,
and thus serve as a simple dispatcher to that.

Yet another advantage is that this provides the posibility to fine-tuning all
operations regarding space/time, if the need arises.

A disadvantage of this approach is that it requires some extra coding in every
container to make it conforming with the DataSetConcept, and also the
Type/Value system has to be built (i.e.
refined and finished).

The main disadvantage is that the design is not finished yet, and thus it's not
certain that it will work in practice.  (I see no reason it shouldn't, but
never say never.)

(When I reread this I realize that I'm strongly biased.  Sorry.  To try to
remedy this, I present a small competition:  If you find some potential
disadvantages of the distributed approach, please air them, and you'll win a
virtual beer that can be cashed in if you come to Denmark.)

Greets,

Asger



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]