Goose Design Problem & a Proposed Solution (long)



(Forgive me if this is a bit incoherent, I've been feverishly hacking
for the last several hours, and I'm a bit tired as I write this...)

I've had very little time to devote to Goose recently, but I've been thinking
a lot about it and I've come to the conclusion that some of the design
needs some tweaking.

My concern is that we've gone to far in "OO-ifying" the interface.  In
particular, I think that there is too much functionality in the
DataSet base class, and that it is introducing awkwardness in my
efforts to extend the paradigm and build new derived classes.  In
particular, I'm very unhappy with my (never committed) efforts to
squeeze categorical data sets into this class heirarchy.

The problem is that our DataSet objects are basically very complex containers
with a lot of special functionality.  In particular, they can contain lots
of different types --- doubles, enums/ints, and strings being the main ones
right now, but more types are to follow.

This means that the set of "shared functions" that can live in the
base class are really quite small.  Every DataSet can carry around a
string which we designate the label().  Every DataSet is assumed to be
some sort of linear array/vector thing, so it has a size() and can be
permuted and reversed.  But you can't necessarily sort() it, since its
elements might not have a natural ordering.

Worst of all, the derived DataSets are containers for very different
types, so you can't really have functions in the base class for adding
a new element to the set, or inspecting item number 37, etc.  This
the add() and lookup-type actions you can perform depend on the type
of the container.  So for certain types of problems, you find yourself
doing something like this:

  if (my_set->is_a_realset())
    // do one thing
  else if (my_set->is_a_categoricalset())
    // do another thing
  else if ...

This kind of "type test & switch statement" logic is condemned on page
one of every book on OO programming.  So the solution that we've
gravitate to is to introduce the Value class, which lets us simulate
typeless-object semantics in C++.  Then we can have a virtual
void add(Value), a virtual Value lookup(index), etc. function in the
base class.

But all that we've really done is added an extra level of indirection,
and pushed the type-test-and-switch down to another class... because
adding the wrong Value-type to the wrong DataSet-type will produce
some sort of error, probably a thrown exception.  So the programmer
ends up doing something like:

  if (my_value->is_a_real())
    // do one thing
  else if (my_value->is_a_category())
    // do another thing
  else if ...

The same "impure" code... but we've now also heaped a bunch of extra
complexity onto the problem.  This is bad.

It seems to me that the problem is simply this:
The derived classes of DataSet, in many cases, have only a very
minimal set of properties in common with one another.  We are trying
to "force" sets of #s, sets of categories, sets of strings, etc., to
all be very like each other.  But they aren't... both in their
implementation details and in how they are used in the narrow realm
that these containers are designed for: statistical programming.

The origin of all of this type-system business really came from a much
simpler problem: we needed a way to map container entries to string
representations of those entries, as well as the other way around.

So I propose we do the following:
(1) Rip out all of the Value stuff and the DataType stuff.
(2) Design a simple class to encapsulate stuff->string and
    string->stuff conversions.
(3) Mix our converter class into our DataSets in some nice way.

This should give us simpler code that is free of the excessive
generality that is currently bogging us down.  (Actually, we don't
have excessive generality --- we have the *wrong* generality, which
doesn't suit other parts of the problem domain well.)

Now the big problem with this is that it will break Asger's data
importing stuff.  I hate doing this, but it seems unavoidable.  The
good news is that I'm sure that the importer code will bounce back
better than ever once we get our foundations in order...

I'll try to work on these changes in the next few days.  I've got a
very big commit pending here already (I've added a ton of special
functions stuff, moving all that code into a seperate shared library),
so I might as well just make it bigger.

-JT



 




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]