[gnumeric-list gnome org: Re: to csv or not to csv]

From: User Hayden <bob statland org>
To: gnumeric-list gnome org
Subject: [gnumeric-list gnome org: Re: to csv or not to csv]
Date: Mon, 5 Oct 2020 12:09:19 -0400


I'm in statistics and grab CSV files from all over the world.  Like
John, I have experienced very few problems.  I usually open these in a
spreadsheet, clean as needed, then save as CSV and open in R.  Any
problems are usually because the creator did not follow the rules,
vague tho they may be.  I also find it a huge advantage that most CSV
files can be opened in a plain text editor and are not full of cryptic
markup or, worse, are binary files.  To me CSV files seem much less
troublesome than any spreadsheet file format.



----- Forwarded message from John Denker via gnumeric-list <gnumeric-list gnome org> -----

Date: Mon, 5 Oct 2020 08:19:11 -0700
From: John Denker via gnumeric-list <gnumeric-list gnome org>
To: gnumeric-list <gnumeric-list gnome org>
Subject: Re: to csv or not to csv
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
        Thunderbird/68.10.0

On 10/4/20 8:34 PM, aguelzow wrote:


The issue is that csv is not very well defined.


That has never been a problem in the work that I do.
I can send and receive .csv with no problems.

So define your own text file format


1) As previously stated, literally every day I receive CSV files
from outside sources. The format is their choice, not mine.

2) So I hereby define my own personal text file format to be
comma separated values, as defined in RFC 4180 with a couple
of well-known relaxations(*).

  https://tools.ietf.org/html/rfc4180

I do not see any advantage to choosing any other text-file format.
Any other choice would be *less* standardized and *less* portable.

*) The relaxations are:
-- <CR> at end of line is optional. That is, unquoted <LF> is 
 accepted in lieu of the RFC <CR><LF>.  Every CSV parser I've
 ever seen accepts this.
-- The RFC requirement that all lines have the same number of fields
 is not enforced.

==========================================


On 10/4/20 8:26 PM, Tim Chase wrote:


CSV files come with lots of potential issues, mostly revolving around
a lack of standardization:

- encoding may or may not be specified (is this UTF8? UTF16? UTF32?
  Latin1? Windows-1252? any of a gazillion other encodings?)


That has never been a problem for me.  I doubt it is much of a problem
for anybody else.  CSV is transparent and agnostic to any of the byte-
based encodings you mention, because the only characters it cares about
are <comma>, <doubleQuote>, and unquoted <CR><LF> ... which are the same
in UTF-8, Latin1, Windows-1252, and a great many other encodings.

- how do you quote the quote character (doubling it, escaping with a
  backslash, encoded with some other escape method, ...)


You double it, as specified in the RFC.
This has never been a problem in the work that I do.

- does it distinguish between an empty value and an empty quoted
  value? (sometimes the former means Null while the latter means an
  empty string; other times they're the same)


a) In accordance with the RFC, CSV does not distinguish.

b) FWIW I have a C++ version of the parser that is templated.  You can
ask it to return strings *or* a derived class (derived from string) that
has an extra bit that says whether the value was quoted.  This allows me
to write a program that performs the identity transformation on CSV files,
not just semantically identical but verbatim and literatim identical.  I
have never used this feature except for the exceedingly narrow purpose of
testing the parser.  So again I say, this has never been a problem in the
work that I do.

- should one expect headers? If so, does case matter?  Does order
  matter? (I often have columns move around but if accessed by
  header, they're adequately consistent)


Sometimes there are headers. Sometimes not. Sometimes there are headers
that can be ignored. This has never been a problem in the work that I do.

- can more than one column have the same header?


The RFC doesn't say. This is a higher-level issue, above the CSV layer.

- what should happen if a row has fewer entries than the header row?


The RFC says the array should be rectangular. However, world does not end
if the array is non-rectangular:

        An,old,silent,pond
        A,frog,jumps,into,the,pond???
        Splash!,Silence,again.

The lack of headers and the non-rectangularness is not an issue for the CSV
parser.

- what should happen if a row has *more* entries than the header row?


See previous answer.

Also note that I have a utility that operates at the aforementioned
higher level, taking a rectangular CSV file with headers and converts
it to SQL. This higher level requires the headers to be well behaved,
but again this is not within the remit of the CSV parser.

- what should happen if there's no header row, but rows don't have
  the same number of columns?


See above.

- parsing with some tools like awk(1) can become tedious when the
  comma-delimiter can appear within the data (so you have to
  special-case the quoting)


That is one of the eleventeen reasons why I don't use awk to parse CSV
files. I use perl and C++. I have libraries that parse CSV properly.

- is the end-of-line character a Unix "LF", a DOS "CR/LF", an old Mac
  "CR", or the largely-unused Record Separator (RS=0x30)


The RFC mandates <CR><LF> but plain <LF> is AFAICT universally accepted
also.

- what happens if data contains newlines in it?  does odd quoting
  mean that the row is continued on the next line?


The RFC makes it clear that newlines can appear in quoted strings.
This is not an issue in the work that I do. I have never encountered
a quoted newline except when testing my parser.

- sometimes things are called CSV when they use alternate delimiters
  such as tab (though often called TSV files), pipe, colon, or
  whatever other delimiter character that comes up on a whim


This is not a problem in the work that I do.
It is not a problem with CSV.
Garbage in, garbage out.

- the data is largely 2d only, so there's no mechanism for including
  multiple sheets of data other than multiple files


That brings us back to my original question:  Suppose I have multi-
dimensional data.  What format should I use for exporting it from
gnumeric?

If you know the answers to those questions above for your data in
question or haven't hit any of those issues, and you know that the
file-format is predictable, then I would treat the "don't use CSV
files" as more of an admonition to know what you're doing.


OK.

Please consider the possibility that I sometimes know what I'm doing.

if something breaks, you get to keep all the pieces.  It's an
unfortunately underdefined (but common) means for transmitting data.
There are better ways, but <opinion class=controversial>like PHP,
JavaScript, and MySQL, they are used because they're popular, not
because they're particularly good; I use PHP, JavaScript,
MySQL, and CSV files for their ubiquity, not their
excellence.</opinion>  So use guilt-free, but use with caution.


Is there a way to export SQL from gnumeric?  I don't see it mentioned
in the documentation.

By javascript I assume you mean JSON. Have you had good luck parsing
that using awk?  Is there a way to export JSON from gnumeric?
_______________________________________________
gnumeric-list mailing list
gnumeric-list gnome org
https://mail.gnome.org/mailman/listinfo/gnumeric-list

----- End forwarded message -----

-- 

      _
     | |          Robert W. Hayden
     | |          5 Howard Street, Apartment 206
    /  |          Wilton, New Hampshire 03086  USA
   |   |          
   |   |          email: bob@ the site below
  /    |          website: http://statland.org
 |   x /          
 ''''''

Follow-Ups:
- Re: [gnumeric-list gnome org: Re: to csv or not to csv]
  - From: Emmanuel Pacaud

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]