Re: data analysis bug




Andreas J. Guelzow wrote:

I guess I should really file an official bug report but here is the short form:

The pooled variance calculated for the "T-test: Two Sample Assuming Equal Variances" is not always correct:

sample 1: 1,2,3
sample 2: 1, 2.2, 3.4

individual variances are 1 and 1.44. The pooled variance is a weighted average of those two, ie. must be lying between 1 and 1.44 (in fact it is 1.22), but gnumeric calculates 0.988

This will normally also affect the t statistic etc. I didn't check whether they were incorrect but since they should use the pooled variance....

This is today's CVS version of gnumeric. (But I observed the same problem in an earlier version while using gnumeric in an introductory statistics class this May.)


I didn't think I could find this in the code that fast (but the code very nicely written). In analysis-tools.c lines 1436ff
are:

  var = (set_one.sqrsum + set_two.sqrsum - (set_one.sum + set_two.sum) *
         (set_one.sum + set_two.sum)/ (set_one.n + set_two.n)) /
      (set_one.n + set_two.n - 1);  /* TODO: Correct??? */

this calculation is incorrect but should be:
var = ((set_one.sqrsum - set_one.sum2 / set_one.n)+(set_two.sqrsum - set_two.sum2 / set_two.n))/
          (set_one.n + set_two.n - 2);

Interestingly enough in the following calculation for the t value this pooled variance isn't even used but the t-value under the assumption of unequal variances is calculated:

t = fabs (mean1 - mean2 - mean_diff) /
      sqrt (var1 / set_one.n + var2 / set_two.n);

should really be:

t = fabs (mean1 - mean2 - mean_diff) /
       sqrt (var / set_one.n + var / set_two.n);

Andreas




--
Prof. Dr. Andreas J. Guelzow Assoc. Prof of Mathematics
Concordia University College of Alberta
http://www.math.concordia.ab.ca/aguelzow





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]