Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)



On 10/11/20 1:02 PM, Morten Welinder wrote:

TL;DR: Avoid csv files if you can.

This note described what Gnumeric does with csv files and why csv files are
a bad idea to begin with.

After reading the note I came to a somewhat different conclusion.

The overwhelming #1 conclusion is wow, this really needs to be
documented.  Clearly and prominently documented.

Input behavior should be documented separately from output behavior.
For example, AFAICT gnumeric never emits a formula to a .csv file,
but I learned today, to my amazement, that it (sometimes!) accepts
formulas on input from a .csv file.  This is a very valuable feature.

Again, documentation is key.

Very often, I write .csv files with the intention of importing
them into gnumeric half a second later. Rather than configuring
a custom gnumeric importer, it would be easier for me and in all
ways better for me to configure my source to produce an unambiguous
.csv file.


Conclusion #2, at a more detailed technical level:

It appears are three layers:
— The character-encoding layer.
— The strings layer.
— The semantic layer.

To my way of thinking, CSV applies to the middle layer.  It has nothing to
say about the lower or higher layers, which is where all the problems are.

I use utf-8 for the character encoding, so that's not a problem, either.
AFAICT all the tricky issues, including locale issues, are above the CSV
layer.

CSV transports strings.  Above the CSV layer, at the the semantic layer,
gnumeric says we should interpret some strings as strings, some as numbers,
some as dates, some as formulas (!), et cetera.

3. Dates in 2020-Oct-11 or 2020-10-11 format.

Desired behavior: We agree that 2020-Oct-11 is highly desirable as an
input data format.  I would dearly love to use this format.

Observed behavior: I observe that it is *NOT* accepted as a date.  I tried
it using the following .csv file:
2020-Oct-11,,=0+A1
2020-10-31,,=0+A2

Should I report this as a bug?

Gnumeric looks at entire columns in order to guess what date format is
being used.

Wow, I never would have guessed that.  This really needs to be documented.

2. An initial "=" can be used to force interpretation as a formula.

This sounds treeeemendously valuable.

Alas it seems to be exceedingly unreliable.
I tried the following .csv file which behaved as expected:
0000
"=""0000"""

Then I tried a seemingly similar file which created a total mess:
0000,"=""0000"""

Should I report this as a bug?




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]