Re: [guppi-list] Re: Patches for Goose



On Wed, 31 Mar 1999, Jon Trowbridge wrote:

> On Tue, Mar 30, 1999 at 08:39:40PM -0500, Bradford Hovinen wrote:
> > 
> > Greetings!
> > 
> > I'm submitting to the list several additions to Goose that I hope you will 
> > find useful. 
> 
> Excellent!
> 
> > 
> > First, I looked through the source code for a while for statistical tests,
> > and while my knowledge of statistics hardly qualifies me as an expert, I
> > was somewhat hard-pressed to find anything that did a performed the tests
> > for sample proportions and sample means.
> 
> You didn't find them because they weren't there.  In Goose, we do the
> obscure first... the commonplace comes later. :-)

I suppose that since I am only a very elementary student of statistics,
I'm a pretty good person to take care of the commonlplace :-)

> > `hypo-test.patch.gz' implements several common tests, including 1- and 2-
> > proportion z-tests, 1- and 2- sample t-tests, a 1-sample z-test, and a
> > paired t-test.
> 
> I actually had many of these written already, but hadn't checked them
> in.  I'll go through your patch and extract what I can from it.

Good. What do you think of the overall architecture I chose?

> > It also has a rather rudimentary chi square test based
> > on CategoricalSet, but it is currently #ifdef'ed out since CategoricalSet
> > is apparently not functioning.
> 
> Yeah, I still haven't decided how to best represent categorical data.
> What is in there now is a (fairly broken) early rough draft.  The
> question of how to do this right needs to be addressed eventually.

Based on what I've studied in the realm of chi square tests, the test
itself should have support for general categories, N-way tables, and
discrete data (or continuous data divided into intervals) that can be
modelled with some kind of function, either continuous or discrete. Other
uses of categorical data are mostly descriptive, dealing with segmented
bar graphs and so forth, so that's pretty easy to model with a simple
associative structure like the C++ map object.

If you agree that the categorical set and the chi square test should be
closely related, we might want to organize the former so that the latter
can be done with the greatest facility. Perhaps a base class representing
any kind of categorical data could be defined, and each of the above cases
could inherit this base class. Some kind of iterator should allow the chi
square test to go through the categorical data in a manner independant of
the specific type of data to calculate the chi square statistic.

> > Each test is implemented as a class inheriting the base HypothesisTest...
> 
> Good.  I've been focusing on confidence interval methods so, to avoid
> code duplication, we might want to define the hypothesis tests in
> terms of the (more general) confidence intervals.

Ok, provided that the confidence level is properly adjusted for 1-tailed
tests.

> > The proportion z-tests' constructors throw an exception if the basic
> > assumptions to combat skewness in the sampling distribution are not met.
> 
> Now this raises an interesting philosophical issue: what do you do if
> someone runs a test on data that doesn't match the underlying
> assumptions of the test?  I don't mean pathological data, but
> situations where the test statistic (or whatever) can full well be
> calculated, but just won't necessarily be meaningful.
> 
> I think that throwing an exception is not The Right Thing to do here.
> Exceptions should be from unrecoverable errors, not a tool to stop
> people from performing well-defined but ill-advised operations.  If
> nothing else, this would make it impossible to write programs that
> analyze how common tests *fail* when various assumptions are
> violated.  (This isn't exactly an everyday thing to do, but it
> certainly isn't something that we should implicitly disallow.)

I can understand your logic here. Perhaps a higher-level entity could
produce a warning when that occurs, but still allow the test to proceed.

<snip>
> 
> I'll apply your patches, tweak things to eliminate duplication, and
> check stuff into CVS in the next day or two.

Thanks.

> If you (or anyone else reading this) are interested in projects to
> work on, here are some ideas.  (Of course, what anyone does should be
> driven by their own interests.)
> 
> * Confidence intervals and inference on estimated variances.
> 
> * Multiple Regression, following the model of how things are done
>   in the case of simple Regression.
> 
> * Confidence intervals and inference on parameters of other
>   distributions, such as exponential, poisson, etc.
> 
> * Figure out a good interface and internal representation for
>   categorical data sets.  A good solution will be general enough to
>   work well for N-way tables.  However, 1-way and 2-way layouts should
>   still be very easy to deal with.
> 
> * Anything for dealing with time series.  A way to fit time series
>   data to various models (AR, ARIMA, ARCH, GARCH, etc.) would be
>   excellent.

Most of those things I'll have to research a bit more before I contribute 
much of anything, but if I have some time in the next few weeks I'll
consider it.

> -JT

-Bradford Hovinen



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]