Re: [gnumeric-list gnome org: Re: to csv or not to csv]



Hi,

While reading this message exchange, I was a bit on the side "CSV is helpfull, but troubles are not far away". As a french locale user, my favorite issue is number internationalization.

And I have stumbled upon this article: https://arstechnica.com/tech-policy/2020/10/excel-glitch-may-have-caused-uk-to-underreport-covid-19-cases-by-15841/

Cheers,

Emmanuel.

Le lun. 5 oct. 2020 à 12:09, User Hayden via gnumeric-list <gnumeric-list gnome org> a écrit :
I'm in statistics and grab CSV files from all over the world. Like John, I have experienced very few problems. I usually open these in a spreadsheet, clean as needed, then save as CSV and open in R. Any problems are usually because the creator did not follow the rules, vague tho they may be. I also find it a huge advantage that most CSV files can be opened in a plain text editor and are not full of cryptic markup or, worse, are binary files. To me CSV files seem much less troublesome than any spreadsheet file format. ----- Forwarded message from John Denker via gnumeric-list <gnumeric-list gnome org> ----- Date: Mon, 5 Oct 2020 08:19:11 -0700 From: John Denker via gnumeric-list <gnumeric-list gnome org> To: gnumeric-list <gnumeric-list gnome org> Subject: Re: to csv or not to csv User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 On 10/4/20 8:34 PM, aguelzow wrote:
The issue is that csv is not very well defined.
That has never been a problem in the work that I do. I can send and receive .csv with no problems.
So define your own text file format
1) As previously stated, literally every day I receive CSV files from outside sources. The format is their choice, not mine. 2) So I hereby define my own personal text file format to be comma separated values, as defined in RFC 4180 with a couple of well-known relaxations(*). https://tools.ietf.org/html/rfc4180 I do not see any advantage to choosing any other text-file format. Any other choice would be *less* standardized and *less* portable. *) The relaxations are: -- <CR> at end of line is optional. That is, unquoted <LF> is accepted in lieu of the RFC <CR><LF>. Every CSV parser I've ever seen accepts this. -- The RFC requirement that all lines have the same number of fields is not enforced. ========================================== On 10/4/20 8:26 PM, Tim Chase wrote:
CSV files come with lots of potential issues, mostly revolving around a lack of standardization: - encoding may or may not be specified (is this UTF8? UTF16? UTF32? Latin1? Windows-1252? any of a gazillion other encodings?)
That has never been a problem for me. I doubt it is much of a problem for anybody else. CSV is transparent and agnostic to any of the byte- based encodings you mention, because the only characters it cares about are <comma>, <doubleQuote>, and unquoted <CR><LF> .. which are the same in UTF-8, Latin1, Windows-1252, and a great many other encodings.
- how do you quote the quote character (doubling it, escaping with a backslash, encoded with some other escape method, ...)
You double it, as specified in the RFC. This has never been a problem in the work that I do.
- does it distinguish between an empty value and an empty quoted value? (sometimes the former means Null while the latter means an empty string; other times they're the same)
a) In accordance with the RFC, CSV does not distinguish. b) FWIW I have a C++ version of the parser that is templated. You can ask it to return strings *or* a derived class (derived from string) that has an extra bit that says whether the value was quoted. This allows me to write a program that performs the identity transformation on CSV files, not just semantically identical but verbatim and literatim identical. I have never used this feature except for the exceedingly narrow purpose of testing the parser. So again I say, this has never been a problem in the work that I do.
- should one expect headers? If so, does case matter? Does order matter? (I often have columns move around but if accessed by header, they're adequately consistent)
Sometimes there are headers. Sometimes not. Sometimes there are headers that can be ignored. This has never been a problem in the work that I do.
- can more than one column have the same header?
The RFC doesn't say. This is a higher-level issue, above the CSV layer.
- what should happen if a row has fewer entries than the header row?
The RFC says the array should be rectangular. However, world does not end if the array is non-rectangular: An,old,silent,pond A,frog,jumps,into,the,pond??? Splash!,Silence,again. The lack of headers and the non-rectangularness is not an issue for the CSV parser.
- what should happen if a row has *more* entries than the header row?
See previous answer. Also note that I have a utility that operates at the aforementioned higher level, taking a rectangular CSV file with headers and converts it to SQL. This higher level requires the headers to be well behaved, but again this is not within the remit of the CSV parser.
- what should happen if there's no header row, but rows don't have the same number of columns?
See above.
- parsing with some tools like awk(1) can become tedious when the comma-delimiter can appear within the data (so you have to special-case the quoting)
That is one of the eleventeen reasons why I don't use awk to parse CSV files. I use perl and C++. I have libraries that parse CSV properly.
- is the end-of-line character a Unix "LF", a DOS "CR/LF", an old Mac "CR", or the largely-unused Record Separator (RS=0x30)
The RFC mandates <CR><LF> but plain <LF> is AFAICT universally accepted also.
- what happens if data contains newlines in it? does odd quoting mean that the row is continued on the next line?
The RFC makes it clear that newlines can appear in quoted strings. This is not an issue in the work that I do. I have never encountered a quoted newline except when testing my parser.
- sometimes things are called CSV when they use alternate delimiters such as tab (though often called TSV files), pipe, colon, or whatever other delimiter character that comes up on a whim
This is not a problem in the work that I do. It is not a problem with CSV. Garbage in, garbage out.
- the data is largely 2d only, so there's no mechanism for including multiple sheets of data other than multiple files
That brings us back to my original question: Suppose I have multi- dimensional data. What format should I use for exporting it from gnumeric?
If you know the answers to those questions above for your data in question or haven't hit any of those issues, and you know that the file-format is predictable, then I would treat the "don't use CSV files" as more of an admonition to know what you're doing.
OK. Please consider the possibility that I sometimes know what I'm doing.
if something breaks, you get to keep all the pieces. It's an unfortunately underdefined (but common) means for transmitting data. There are better ways, but <opinion class=controversial>like PHP, _javascript_, and MySQL, they are used because they're popular, not because they're particularly good; I use PHP, _javascript_, MySQL, and CSV files for their ubiquity, not their excellence.</opinion> So use guilt-free, but use with caution.
Is there a way to export SQL from gnumeric? I don't see it mentioned in the documentation. By _javascript_ I assume you mean JSON. Have you had good luck parsing that using awk? Is there a way to export JSON from gnumeric? _______________________________________________ gnumeric-list mailing list gnumeric-list gnome org https://mail.gnome.org/mailman/listinfo/gnumeric-list ----- End forwarded message -----
--
_ | | Robert W. Hayden | | 5 Howard Street, Apartment 206 / | Wilton, New Hampshire 03086 USA | | | | email: bob@ the site below / | website: http://statland.org | x / '''''' _______________________________________________ gnumeric-list mailing list gnumeric-list gnome org https://mail.gnome.org/mailman/listinfo/gnumeric-list


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]