R: Re: to csv or not to csv

Fundamental. For example to export data to R. Ahhhh the times where the gnumeric R interface was there... 

Inviato dallo smartphone Xperia di Sony

---- Tim Chase ha scritto ----

On 2020-10-04 19:02, John Denker via gnumeric-list wrote:
The first rule of csv files is "don't use csv files".  

That scares me.  In just one of my directories, I just now counted
two dozen .csv files created in the last 24 hours.  A total of 12
megabytes today, just in this one directory.  There are others.

My professional life depends on .csv files that I get from various
sources. Data is available to me in that format, and often no other.

Very often I need to do calculations that can't be done in a
spreadsheet, so I export the data, krunch it using thousands of
lines of C++ and/or perl, and then import it again.

CSV files come with lots of potential issues, mostly revolving around
a lack of standardization:

- encoding may or may not be specified (is this UTF8? UTF16? UTF32?
  Latin1? Windows-1252? any of a gazillion other encodings?)

- how do you quote the quote character (doubling it, escaping with a
  backslash, encoded with some other escape method, ...)

- does it distinguish between an empty value and an empty quoted
  value? (sometimes the former means Null while the latter means an
  empty string; other times they're the same)

- should one expect headers? If so, does case matter?  Does order
  matter? (I often have columns move around but if accessed by
  header, they're adequately consistent)

- can more than one column have the same header?

- what should happen if a row has fewer entries than the header row?

- what should happen if a row has *more* entries than the header row?

- what should happen if there's no header row, but rows don't have
  the same number of columns?

- parsing with some tools like awk(1) can become tedious when the
  comma-delimiter can appear within the data (so you have to
  special-case the quoting)

- is the end-of-line character a Unix "LF", a DOS "CR/LF", an old Mac
  "CR", or the largely-unused Record Separator (RS=0x30)

- what happens if data contains newlines in it?  does odd quoting
  mean that the row is continued on the next line?

- sometimes things are called CSV when they use alternate delimiters
  such as tab (though often called TSV files), pipe, colon, or
  whatever other delimiter character that comes up on a whim

- the data is largely 2d only, so there's no mechanism for including
  multiple sheets of data other than multiple files

None of these is necessarily a deal-breaker.  I deal with processing
hundreds of MB (maybe even GB) of CSV files each month using Python &
awk, but the road is paved with the above perils.

If you know the answers to those questions above for your data in
question or haven't hit any of those issues, and you know that the
file-format is predictable, then I would treat the "don't use CSV
files" as more of an admonition to know what you're doing.  And that
if something breaks, you get to keep all the pieces.  It's an
unfortunately underdefined (but common) means for transmitting data.
There are better ways, but <opinion class=controversial>like PHP,
JavaScript, and MySQL, they are used because they're popular, not
because they're particularly good; I use PHP, JavaScript,
MySQL, and CSV files for their ubiquity, not their
excellence.</opinion>  So use guilt-free, but use with caution.


gnumeric-list mailing list
gnumeric-list gnome org

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]