Re: strings in gnumeric / awk / etc.



Hello,

Prof J C Nash wrote:
Some of the issues being raised suggest that a spreadsheet is not the right analytic tool. How about a data frame in R?

Well, this is difficult, too. When there is a bunch of diagnoses (or symptoms) lumped together - in one single column, that won't be easy to work in R either.

A much more difficult subject is when a patient stays for longer than one day (that is usually the case) and I need a specific string (say diagnoses, symptom, ...), which may happen on any of the days, BUT I need either the first occurrence, or the number of days with this diagnoses, or some more complex search. I do work extensively with R (that is why I posted this OOo issue, http://qa.openoffice.org/issues/show_bug.cgi?id=66589), but this is NO substitute to a spreadsheet.

Actually, spreadsheets are still the most used application in life-sciences. I find even Epi-Info NOT as good (though it has better analysis possibilities than a spreadsheet, BUT - of course - it cannot compete with R). Almost every doctor will use Excel and it is the de facto standard when doctors perform some research (I refuse to use it, while some epidemiologists use Epi-Info, but I believe these are mere exceptions).

I posted another use for the gawk, see the OOo issue http://qa.openoffice.org/issues/show_bug.cgi?id=66816, where I wanted to create some dummy variables for the medical department:

GAWK SCRIPT
#($1 contains the input - the hospital unit)
$2 = 0 # neurosurgery vs non-neurosurgery
$3 = 0 # neurology vs non-neurology
$4 = 0 # general surgery vs non-surgery
$5 = 0 # internal medicine vs non-im
$6 = 1 # ERROR var, if unknown abreviation

$0 = tolower($0)

# NEUROSURGERY
/nch/ {$2 = 1, $6 = 0 }

# Neurology
/^n[ \t]*$|^ne/ {$3 = 1, $6 =0 }

# General Surgery
/^ch/ {$4 = 1, $6 =0 }

# INTERNAL MEDICINE
/mi|end|nut/ {$5 = 1, $6 =0 }

print $0 >> 'out-file'

### END SCRIPT

Try to do this with spreadsheet functions, and it will turn out into a nightmare.

gawk has many advantages and I may point another two:
- it is easy and simple, and very very fast (both to write and execute - even on huge datasets) - the code is structured and visible, so it is easy to understand what it does (this is NOT always the case when you write complex formulas in the spreadsheet)

I hope these are enough reasons to implement a simple menu-entry in gnumeric that runs awk/gawk scripts.

Specifically:
- the user selects some cells
- chooses Menu-Entry: RUN gawk-script (a dialog box opens allowing the user to select the proper script)
- gnumeric should then open a bidirectional pipeline to gawk
- should add some default values for the FieldSeparator (FS) and RecordSeparator (RS), that should be also used to split (join) the Cells and Rows in the worksheet when pipelining the data stream into gawk - gawk's output should be split back into cells (using the same FS and RS) (probably into a new sheet, like ANOVA)

I believe this is easy to code and quite useful.

Many thanks in advance,

Leonard Mada



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]