Median: Oasis and Fast Sorting Algorithm
- From: Leonard Mada <discoleo gmx net>
- To: dev sc openoffice org, gnumeric-list gnome org
- Subject: Median: Oasis and Fast Sorting Algorithm
- Date: Sat, 10 Feb 2007 00:38:41 +0200
This post deals with algorithms to calculate the median and the
repercussions of such algorithms on the definition of the median.
To calculate the median, many textbooks indicate that the array must be
*sorted first*. This is both unnecessary as well as very *expensive* as
a full sort is very time-consuming.
1. I will therefore try to introduce some methods that do NOT need full
sort to calculate the median. These are significantly faster than the
classic algorithm. Because the sorted array is NOT further used in a
spreadsheet after calculating the median, the sort is in most
circumstances superfluous.
2. The *OASIS open formula document* puts the sorting of the array as a
prerequisite to defining the median. This is misleading and should
therefore be replaced with a more algorithm neutral definition.
Unfortunately, the subscription to OASIS is too expensive for me,
therefore I hope that other persons that do have access to the
development board point this out.
I do not know, which algorithms do gnumeric and OOo Calc use when
calculating the median, BUT, because these alternative algorithms offer
many advantages, they should be strongly considered (IF the current
algorithm uses the usual full sort).
1. FAST ALGORITHMS
==================
Some fast algorithms are presented on the following page:
http://ndevilla.free.fr/median/median/node20.html
I have imagined a different algorithm, too, although I never tested it.
I will briefly describe the case with 2*n elements (2n+1 should be similar):
1. take first (n+1) elements
2. sort these elements using e.g. Quick Sort (if enough memory) or a
different O(n log (n)) algorithm => this sorted array is x[0] to x[n]
(has n+1 elements)
3. there are still (n-1) elements to process
4. WHILE (still unsorted elements) {
5. IF (next element > x[n]) => drop this element and drop x[0]; => we
have (n-2) elements left to process; and (n) elements in the sorted
array: x[1] to x[n]; => renaming array to x[0] to x[n-1];
6. ELSE IF (next element < x[0]) => drop this element; drop x[n], too;
=> (n-2) elements left to process; (n) elements left in sorted array:
x[0] to x[n-1];
7. ELSE: drop x[0] and x[n]; new array is x[1] to x[n-1] (n-2 elements);
put new element in this array and sort this new array (because this
array is already sorted, adding one value should proceed with log(n)
speed) => will have now (n-1) elements => renaming (new x[0]) to (new
x[n-1]);
8. REPEAT (instead of 'n' use last element in sorted array)
Initially: (n+1) values in sorted array; (n-1) still to process
1st iteration: (n) values; (n-2) values;
2nd iteration: (n-1) values; (n-3) values;
...
(n-2) iterations: 3 values; 1 value left to process;
last iteration: in the end, the sorted array will consist of only 2
values => median = (x[0]+x[1])/2
This algorithm should have at most O(n/2 * log(n) -1 + log[(n/2 -1)!])
(where '!' is factorial), but it is probably faster. [I am not sure,
that my calculation is actually accurate, so take this result carefully.]
It is NOT recursive, it computes actually the median accurate even for
an even number of elements and does NOT consume as much memory as other
algorithms.
2. OASIS Open Formula Format
=======================
The OASIS open formula document (2007-02-08) describes the median as:
MEDIAN
Summary: Returns the median (middle) value in the list.
Syntax: MEDIAN( { NumberSequence X}+ )
Returns: Number
Semantics:
MEDIAN logically sorts the numbers (lowest to highest). If given an
odd number of values, MEDIAN returns the middle value. If given an
even number of values, MEDIAN returns the arithmetic average of the
two middle values.
This is a bit misleading. *MEDIAN logically sorts the numbers* is NOT
really necessary to calculate the median. Actually, such an algorithm is
notoriously slow. There are far better alternatives to get the median,
and therefore a more neutral definition for the median seems warranted.
Also, this definition is ambiguous for sequences like 1,1,2,3, where the
two middle values are in effect 3 elements (two ones and the element 2).
DEFINITION
==========
This is an algorithmic definition, too, BUT leaves the option open which
algorithm is chosen.
1. Odd numbers: The median is that particular element from the list that
bisects the list into 2 halves, so that one can select 2 non-overlapping
sets with an equal number of elements, so that all elements are greater
or equal to the median in one set and smaller or equal in the 2nd set
(the element that is the median is excluded).
2. Even numbers: The median is the mean of those 2 values that bisect
the list in two equal halves, so that we can find 2 non-overlapping sets
with an equal number of elements, one set in which all elements are
larger or equal to the median and the second where all elements are
smaller or equal to the median (the 2 elements from which the median was
calculated are excluded).
Test Case:
1,2,3 => median is 2 => 2 equal lists: a.) element 1; b.) element 3;
1,2,3,4 => 2 and 3 bisect the list; median is 2.5 => 2 equal lists: a.)
element 1; b.) element 4;
1,1,3 => median is 1 => 2 equal lists: a.) element 1 (only one); b.)
element 3;
1,1,2,2 => 1 and 2 bisect list; median is 1.5 => 2 equal lists: a.)
element 1; b.) element 2 (both only once)
1,1,1 => median is 1 => 2 equal lists: a.) element 1; b.) element 1
(both only once; 3rd element 1 is excluded as it is median)
1,1,1,3 => 1 and 1 bisect list => median is 1 => 2 equal lists: a.)
element 1 (only once; the 2 other elements 1 have been excluded); b.)
element 3;
1,1,1,1 => 1 and 1 bisect list => median is 1; 2 equal lists, each
consisting of one element 1 (the other 2 elements of 1 have been
excluded per definition);
Although this definition is somehow more complex, I believe it is more
accurate and more algorithm neutral.
Kind regards,
Leonard Mada
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]