# Re: Median and other quantiles

*From*: Prof J C Nash <nashjc uottawa ca>
*To*: gnumeric-list gnome org
*Subject*: Re: Median and other quantiles
*Date*: Sat, 10 Feb 2007 15:15:15 -0500

`The rather heated discussion on the median has a historic quality. It is
``not quite as old as the arguments about how many angels can sit on the
``head of a pin, but not too far off. Perhaps I can cool things down by
`providing some perspective.

`There are, to the eyes of someone who has been teaching and working with
``this stuff for nearly 4 decades, three issues that are causing conflict:
`
1) The question of what we actually want "MEDIAN" to compute.
2) The question of how to express this clearly.
3) The question of how to actually do it.

`Question (1) concerns the allowed values for the quantity returned by a
``function of type "quantile", the general term for a median, percentile,
``or any other quantity that returns the "value in the scale of the data
``with rank k from the minimum (or, alternatively, from the maximum".
`

`We can insist on returning values from within the original data set, in
``which case we must sometimes choose between two values, or we can
``specify a rule for interpolation.
`

`The most common rule for the median is to interpolate when the number of
``elements, n, is even, leading to the "definition" that the median is the
``average of the middle ranked values for this case, while it takes the
``value of the middle rank datum when n is odd. I believe all the
``spreadsheets use this rule. Percentiles are usually (but not always)
``computed using interpolation, and I have seen some variety in the values
``given for quartiles in different statistical packages. Rob Hyndman
``published a quite good paper in The American Statistician about 8-10
``years ago on the subject.
`

`John Tukey (stem and leaf graphs, box and whisker plots, five number
``summaries, Exploratory Data Analysis, etc.) liked to have measures that
``belonged to the data set. So he defined Depth as the rank of an
``observation from the nearest "end" (highest or lowest) and Tukey
``boxplots use Hinges rather than Quartiles (which are generally
``interpolated, but not always by consistent formulas).
`

`See http://macnash.admin.uottawa.ca/files/stembox.txt for a bit of a
``discussion on Depths etc.
`

`Ultimately, this question boils down to where to cut to divide 4 candies
``among 5 children. No matter what you do, things get ugly. For large n,
``whether you use interpolation, and what formula for interpolating,
``becomes less important.
`

`When we come to (2), I am surprised nobody is using RANK rather than
``SORT. To compute quantiles, one does need to rank the data in some way,
``but we don't actually need to sort it. There were a number of articles
``about 30 years ago on ranking vs sorting, and if anyone is interested I
``can try to dig some up. If I were writing the MEDIAN definition for the
``TC, I would change it in one word to:
`

`MEDIAN logically _ranks_ the numbers (lowest to highest). If given an
``odd number of values, MEDIAN returns the middle value. If given an even
`number of values, MEDIAN returns the arithmetic average of the two
middle values.

`In other words, I'll live with the interpolation. Actually, I believe
``Tukey (someone correct me if I'm wrong, my copy of EDA is in another
``office) defined median as the value with maximum depth from either end
``or the arithmetic average of the two values having maximal depth if the
``value is not unique.
`

`Question (3) is one for the programmers and algorithm performance
``people. I think a fast ranking algorithm would be very useful to improve
``performance in a lot of spreadsheet features, not just the median. But
``it should be an algorithm that is quite "clean" so that application to
``more than just a column or row range doesn't end up causing all kinds of
``potential for bugs. For the sake of development and testing, it may be
``useful to use compile or even execution switches based on some stored
``control parameter, so that informed users can try out different ranking
``algorithms. Note that this may be helpful for increasing sort
``performance on very large sheets.
`

`Note that RANK() gives the rank of a number in a set. We actually want
``the number that is ranked at some level.
`
No doubt the discussion will continue.
JN

[

Date Prev][

Date Next] [

Thread Prev][

Thread Next]
[

Thread Index]
[

Date Index]
[

Author Index]