Median: Oasis and Fast Sorting Algorithm

This post deals with algorithms to calculate the median and the repercussions of such algorithms on the definition of the median.

To calculate the median, many textbooks indicate that the array must be *sorted first*. This is both unnecessary as well as very *expensive* as a full sort is very time-consuming.

1. I will therefore try to introduce some methods that do NOT need full sort to calculate the median. These are significantly faster than the classic algorithm. Because the sorted array is NOT further used in a spreadsheet after calculating the median, the sort is in most circumstances superfluous.

2. The *OASIS open formula document* puts the sorting of the array as a prerequisite to defining the median. This is misleading and should therefore be replaced with a more algorithm neutral definition. Unfortunately, the subscription to OASIS is too expensive for me, therefore I hope that other persons that do have access to the development board point this out.

I do not know, which algorithms do gnumeric and OOo Calc use when calculating the median, BUT, because these alternative algorithms offer many advantages, they should be strongly considered (IF the current algorithm uses the usual full sort).

Some fast algorithms are presented on the following page:

I have imagined a different algorithm, too, although I never tested it. I will briefly describe the case with 2*n elements (2n+1 should be similar):
1. take first (n+1) elements
2. sort these elements using e.g. Quick Sort (if enough memory) or a different O(n log (n)) algorithm => this sorted array is x[0] to x[n] (has n+1 elements)
3. there are still (n-1) elements to process
4. WHILE (still unsorted elements) {
5. IF (next element > x[n]) => drop this element and drop x[0]; => we have (n-2) elements left to process; and (n) elements in the sorted array: x[1] to x[n]; => renaming array to x[0] to x[n-1]; 6. ELSE IF (next element < x[0]) => drop this element; drop x[n], too; => (n-2) elements left to process; (n) elements left in sorted array: x[0] to x[n-1]; 7. ELSE: drop x[0] and x[n]; new array is x[1] to x[n-1] (n-2 elements); put new element in this array and sort this new array (because this array is already sorted, adding one value should proceed with log(n) speed) => will have now (n-1) elements => renaming (new x[0]) to (new x[n-1]);
8. REPEAT (instead of 'n' use last element in sorted array)

Initially: (n+1) values in sorted array; (n-1) still to process
1st iteration: (n) values; (n-2) values;
2nd iteration: (n-1) values; (n-3) values;
(n-2) iterations: 3 values; 1 value left to process;
last iteration: in the end, the sorted array will consist of only 2 values => median = (x[0]+x[1])/2

This algorithm should have at most O(n/2 * log(n) -1 + log[(n/2 -1)!]) (where '!' is factorial), but it is probably faster. [I am not sure, that my calculation is actually accurate, so take this result carefully.]

It is NOT recursive, it computes actually the median accurate even for an even number of elements and does NOT consume as much memory as other algorithms.

2. OASIS Open Formula Format
The OASIS open formula document (2007-02-08) describes the median as:
Summary: Returns the median (middle) value in the list.
Syntax: MEDIAN( { NumberSequence X}+ )
Returns: Number
MEDIAN logically sorts the numbers (lowest to highest). If given an odd number of values, MEDIAN returns the middle value. If given an even number of values, MEDIAN returns the arithmetic average of the two middle values.

This is a bit misleading. *MEDIAN logically sorts the numbers* is NOT really necessary to calculate the median. Actually, such an algorithm is notoriously slow. There are far better alternatives to get the median, and therefore a more neutral definition for the median seems warranted. Also, this definition is ambiguous for sequences like 1,1,2,3, where the two middle values are in effect 3 elements (two ones and the element 2).

This is an algorithmic definition, too, BUT leaves the option open which algorithm is chosen. 1. Odd numbers: The median is that particular element from the list that bisects the list into 2 halves, so that one can select 2 non-overlapping sets with an equal number of elements, so that all elements are greater or equal to the median in one set and smaller or equal in the 2nd set (the element that is the median is excluded). 2. Even numbers: The median is the mean of those 2 values that bisect the list in two equal halves, so that we can find 2 non-overlapping sets with an equal number of elements, one set in which all elements are larger or equal to the median and the second where all elements are smaller or equal to the median (the 2 elements from which the median was calculated are excluded).

Test Case:
1,2,3 => median is 2 => 2 equal lists: a.) element 1; b.) element 3;
1,2,3,4 => 2 and 3 bisect the list; median is 2.5 => 2 equal lists: a.) element 1; b.) element 4; 1,1,3 => median is 1 => 2 equal lists: a.) element 1 (only one); b.) element 3; 1,1,2,2 => 1 and 2 bisect list; median is 1.5 => 2 equal lists: a.) element 1; b.) element 2 (both only once) 1,1,1 => median is 1 => 2 equal lists: a.) element 1; b.) element 1 (both only once; 3rd element 1 is excluded as it is median) 1,1,1,3 => 1 and 1 bisect list => median is 1 => 2 equal lists: a.) element 1 (only once; the 2 other elements 1 have been excluded); b.) element 3; 1,1,1,1 => 1 and 1 bisect list => median is 1; 2 equal lists, each consisting of one element 1 (the other 2 elements of 1 have been excluded per definition);

Although this definition is somehow more complex, I believe it is more accurate and more algorithm neutral.

Kind regards,

Leonard Mada

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]