Re: [guppi-list] goose datatype system



Asger was also bitten by the "Reply-to" bug and replied only to me.
Anyway, here is his part of the conversation (that preceeded my
response, which was posted a little while ago...)


> I'm confused about what direction the Value/DataType stuff is going
> in... probably because I have a murky idea of how it is should be
> done, and I'm no doubt not in sync with your much clearer idea.

My idea is not much clearer than your idea, because your idea is
crystal clear:  Use the DataSet for a universal container, and
convert the other types to a double in some appropriate way, and
thus exploit the existing DataSet to get statistics for all other
types.  For the sake of the discussion, let's call this approach the
"double" approach:  The responsibility of the doing the statistics
is placed on the double container's shoulders.

My idea is still evolving, and thus not crystal clear, but stated shortly, 
it's something like this: 
Provide separate containers for each datatype, but make them compatible 
in the sense that they conform to a non-existing abstract base-class 
(which is similar to the "concepts" in the SGI STL.  You might know them
as an "interface" or a "schema", but for the discussion, it's appropriate
to think of it as an abstract base class, without the runtime overhead
of these things.)  The idea behind this is that different types need 
different statistics, but at the same time, there are similarities between
the operations you want to perform on them, so some amount of sharing
is appropriate.  For the sake of the discussion, let's call this approach
the "distributed" approach:  The responsibility of doing the statistics
is distributed out to each individual container.

So, there are two discussions involved:

1) Is the double data-type universal enough to perform/simulate any
statistical routine for other types, we are interested in?

2) What technical merits does each approach have?

Regarding 1):

Taken literally, this is a theoretical discussion, which I don't have
the answer to, but for the discussion at hand, let's assume that both
approaches are similar in strength, regarding what statistical things
they can do.  I.e. let's just say that the answer to this question is 
"yes".  (I don't have any examples that show the contrary.)

So, the discussion is basically reduced to a technical discussion instead:

Which approach is the best to solve the problem at hand?

I don't have the answer at this point.

The run-time considerations involved are efficiency-concerns regarding 
time/space.
At this level, both methods are arguably similar for the mayority of
operations, except for constant factors.
The double approach requires a conversion for each element to a double, 
which the distributed approach can avoid, but since this conversion done be
fast (i.e. constant time), at least for most relevant data-types.

The distributed approach has the potential to customize any given
calculation to the exact data-type involved, at the cost of code complexity
and size.  Arguably, such detailed fine-tuning is never needed.

The discussion about space concerns is similar:  The double approach is
space-efficient, while a distributed approach can be made equally efficient
at the cost of increased code complexity.

The primary benefit of the double approach is that we can reuse the
existing statistical routines for all other types, just by defining
a conversion from an instance of a type to a double, and arguably
back again.

So the coding involved is limited.

However, there is a hidden cost involved:  Take for instance the strings:
We will never be able to perform the conversion correctly without some
other data-structure which can serve as a mapping.

The distributed approach has the benefit that we can define exactly
the statistical routines we need, and do it in a natural way, because
we are working in the domain of the problem, rather than an artificial
world.

I'm running out of time for this discussion, but my initial conclusion
regarding which approach is best, is this:

It depends on which (statistical) operations are needed for the other types.

If what is needed is mostly similar to what already exists in the DataSet,
the double approach is best.
If it is mostly different, and requires addition of some "unnatural"
methods to the DataSet, the distributed approach is best.

Luckily, we don't need to choose exclusively:  We can have both at
the same time, by incorporating the double approach in distributed
clothes.

By defining the double-conversions in the DataType class, we automatically
open up for using the double approach, and alternativy do it in a variant
version in the distributed setting:  Let each specific container determine
by itself which approach is best.  If it wants to, it can use the DataSet
to do all the dirty work, and just serve as a dispatcher.  If not, it can
do the dirty work by itself.  And any hybrid therefor is possible.

Also, it might turn out that some of the routines for time and date are
so similar that it makes sense to let the time-container use the date-
container as a slave.

> My thinking is that the DataSet should be the "Universal Container"
> for statistical data.  Different types of data include:
> 
> (1) Numerical data 
> (2) Time/Date data
> (3) Categorical Data (strings)

This is what I call "enumerations".  Essentially, I have to support two
different kinds:  Ordered and unordered categorical data.
Next to this, I have to support string-containers, but does not have
to implement any relevant statistical operations on it.

> (4) Strange numerical data (i.e. circular data)
> 
> My proposed solution is that we always convert everything to a double
> and store it in the DataSet.  The DataType class would be the vehicle
> for converting elements from human readable format to doubles and back
> again.

This explains that previous set-up you had.  I nuked this interface
because I hadn't realize what you were up to :-)

This interface should arguably go back in, because it does make sense.

> For example, (2): Dates and times are often stored internally as the
> number of days since a certain day, number of seconds since a certain
> time, etc.  We would just store these integer quantities in doubles
> instead of integers (which is maybe a bit wasteful, but otherwise
> fine).  The DataSet class would handle all of the conversions from,
> for example, MM/DD/YYYY format to a number, and back again.  Thus a
> plotting program that was plotting X vs. Y, where X and Y are two
> DataSets, would pass the data through some DataType functions before
> using it to label axes, identify the coordinates of specific points,
> etc.

The insight that evolves is that each container should provide
vector<double> access.


> Well, those are my random musings for now.  How does that fit in with
> your (Asger's) ideas?  Or, for that matter, does anyone else on the
> list have comments?

Earlier in my own design considerations, I dismissed the double approach, 
because I was too fast.
Now I see that it has some very nice properties, and at this point,
I think we might be best off by combining the two approachs (the additional
code is limited, and arguably, we can save code).

But this needs more thought and discussion.

Greets,

Asger




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]