# Median: Oasis and Fast Sorting Algorithm

*From*: Leonard Mada <discoleo gmx net>
*To*: dev sc openoffice org, gnumeric-list gnome org
*Subject*: Median: Oasis and Fast Sorting Algorithm
*Date*: Sat, 10 Feb 2007 00:38:41 +0200

`This post deals with algorithms to calculate the median and the
``repercussions of such algorithms on the definition of the median.
`

`To calculate the median, many textbooks indicate that the array must be
``*sorted first*. This is both unnecessary as well as very *expensive* as
``a full sort is very time-consuming.
`

`1. I will therefore try to introduce some methods that do NOT need full
``sort to calculate the median. These are significantly faster than the
``classic algorithm. Because the sorted array is NOT further used in a
``spreadsheet after calculating the median, the sort is in most
``circumstances superfluous.
`

`2. The *OASIS open formula document* puts the sorting of the array as a
``prerequisite to defining the median. This is misleading and should
``therefore be replaced with a more algorithm neutral definition.
``Unfortunately, the subscription to OASIS is too expensive for me,
``therefore I hope that other persons that do have access to the
``development board point this out.
`

`I do not know, which algorithms do gnumeric and OOo Calc use when
``calculating the median, BUT, because these alternative algorithms offer
``many advantages, they should be strongly considered (IF the current
``algorithm uses the usual full sort).
`
1. FAST ALGORITHMS
==================

`Some fast algorithms are presented on the following page:
``http://ndevilla.free.fr/median/median/node20.html
`

`I have imagined a different algorithm, too, although I never tested it.
``I will briefly describe the case with 2*n elements (2n+1 should be similar):
`1. take first (n+1) elements

`2. sort these elements using e.g. Quick Sort (if enough memory) or a
``different O(n log (n)) algorithm => this sorted array is x[0] to x[n]
``(has n+1 elements)
`3. there are still (n-1) elements to process
4. WHILE (still unsorted elements) {

`5. IF (next element > x[n]) => drop this element and drop x[0]; => we
``have (n-2) elements left to process; and (n) elements in the sorted
``array: x[1] to x[n]; => renaming array to x[0] to x[n-1];
``6. ELSE IF (next element < x[0]) => drop this element; drop x[n], too;
``=> (n-2) elements left to process; (n) elements left in sorted array:
``x[0] to x[n-1];
``7. ELSE: drop x[0] and x[n]; new array is x[1] to x[n-1] (n-2 elements);
``put new element in this array and sort this new array (because this
``array is already sorted, adding one value should proceed with log(n)
``speed) => will have now (n-1) elements => renaming (new x[0]) to (new
``x[n-1]);
`8. REPEAT (instead of 'n' use last element in sorted array)
Initially: (n+1) values in sorted array; (n-1) still to process
1st iteration: (n) values; (n-2) values;
2nd iteration: (n-1) values; (n-3) values;
...
(n-2) iterations: 3 values; 1 value left to process;

`last iteration: in the end, the sorted array will consist of only 2
``values => median = (x[0]+x[1])/2
`

`This algorithm should have at most O(n/2 * log(n) -1 + log[(n/2 -1)!])
``(where '!' is factorial), but it is probably faster. [I am not sure,
``that my calculation is actually accurate, so take this result carefully.]
`

`It is NOT recursive, it computes actually the median accurate even for
``an even number of elements and does NOT consume as much memory as other
``algorithms.
`
2. OASIS Open Formula Format
=======================
The OASIS open formula document (2007-02-08) describes the median as:

MEDIAN
Summary: Returns the median (middle) value in the list.
Syntax: MEDIAN( { NumberSequence X}+ )
Returns: Number
Semantics:

`MEDIAN logically sorts the numbers (lowest to highest). If given an
``odd number of values, MEDIAN returns the middle value. If given an
``even number of values, MEDIAN returns the arithmetic average of the
``two middle values.
`

`This is a bit misleading. *MEDIAN logically sorts the numbers* is NOT
``really necessary to calculate the median. Actually, such an algorithm is
``notoriously slow. There are far better alternatives to get the median,
``and therefore a more neutral definition for the median seems warranted.
``Also, this definition is ambiguous for sequences like 1,1,2,3, where the
``two middle values are in effect 3 elements (two ones and the element 2).
`
DEFINITION
==========

`This is an algorithmic definition, too, BUT leaves the option open which
``algorithm is chosen.
``1. Odd numbers: The median is that particular element from the list that
``bisects the list into 2 halves, so that one can select 2 non-overlapping
``sets with an equal number of elements, so that all elements are greater
``or equal to the median in one set and smaller or equal in the 2nd set
``(the element that is the median is excluded).
``2. Even numbers: The median is the mean of those 2 values that bisect
``the list in two equal halves, so that we can find 2 non-overlapping sets
``with an equal number of elements, one set in which all elements are
``larger or equal to the median and the second where all elements are
``smaller or equal to the median (the 2 elements from which the median was
``calculated are excluded).
`
Test Case:
1,2,3 => median is 2 => 2 equal lists: a.) element 1; b.) element 3;

`1,2,3,4 => 2 and 3 bisect the list; median is 2.5 => 2 equal lists: a.)
``element 1; b.) element 4;
``1,1,3 => median is 1 => 2 equal lists: a.) element 1 (only one); b.)
``element 3;
``1,1,2,2 => 1 and 2 bisect list; median is 1.5 => 2 equal lists: a.)
``element 1; b.) element 2 (both only once)
``1,1,1 => median is 1 => 2 equal lists: a.) element 1; b.) element 1
``(both only once; 3rd element 1 is excluded as it is median)
``1,1,1,3 => 1 and 1 bisect list => median is 1 => 2 equal lists: a.)
``element 1 (only once; the 2 other elements 1 have been excluded); b.)
``element 3;
``1,1,1,1 => 1 and 1 bisect list => median is 1; 2 equal lists, each
``consisting of one element 1 (the other 2 elements of 1 have been
``excluded per definition);
`

`Although this definition is somehow more complex, I believe it is more
``accurate and more algorithm neutral.
`
Kind regards,
Leonard Mada

[

Date Prev][

Date Next] [

Thread Prev][

Thread Next]
[

Thread Index]
[

Date Index]
[

Author Index]