More than you ever wanted to know about csv files (Re: to csv or not to csv)

From: Morten Welinder <mortenw gnome org>
To: Gnumeric Mailing List <gnumeric-list gnome org>
Subject: More than you ever wanted to know about csv files (Re: to csv or not to csv)
Date: Sun, 11 Oct 2020 16:02:33 -0400

TL;DR: Avoid csv files if you can.

This note described what Gnumeric does with csv files and why csv files are
a bad idea to begin with.

Note: this is about unattended import of csv files.  With the configurable
text importer, Gnumeric can be convinced to import just about any reasonable
text-in-columns file.

There are three phases to csv file import:

1. Character set resolution.
2. The syntactic level -- chopping the file into lines and lines into fields.
3. The semantic level -- making sense of fields.


Phase 1: Character Set Resolution

Most people aren't aware, nor do they need to be aware, of this level.
It governs
how the bytes in the file are turned into characters.  Most text you find on a
Linux system will be UTF-8 encoded -- including a very large fraction that is
just ASCII -- but Windows originated files will occasionally have
UTF-16 encoding,
either little or big endian, with or without a BOM marker.  The
Windows files will
use 2 bytes per character while UTF-8 files use a variable length, mostly 1 byte
per character for western text.

The reason you don't have to think about this is that detection is
mostly automatic
and that the automatic system rarely fails.  The only reason to
intervene would be
if you got ahold of some 1990s non-ASCII file with, typically, Eastern-European
text in it.

The Gnumeric solution here is to map everything to UTF-8, but for the
purposes here
you might as well think "ASCII".


Phase 2: The syntactic level

This level handles chopping a file into lines and the resulting lines
into fields.
This is the (only) level that RFC-4180 has an opinion on.

Let's start with lines.  RFC-4180 specifies that lines are terminated with \r\n.
In particular, if a file uses unix style \n only, then as far as RFC-4180 is
concerned the whole file is only a very long line.  Since \n files
occur often in
the real world, following RFC-4180 strictly is clearly not a viable option.

The Gnumeric solution is to accept \r\n, \n, and even the old Mac \r as line
terminators.  Moreover, the last line may omit the terminator.  The line part of
the syntactic level rarely causes problems.

The fields part of the syntactic level is where the wild west starts.  Here
are some samples to think of, with comments that aren't part of the files.

   1,2,3                 # Easy
   4,5,6

   100.22,222.34,-123    # Still easy
   444,,123.45

   "111,22",222,"22,33"  # Not too bad, typical in Europe
   "1,22",,"222,11"

   111,22;222;22,33      # Same as above other than terminators
   1,22;;222,11          # (And, yes, still called csv even though the separator
                         # isn't comma!)

   "Quote ""this""",text # Double quote in quoted text for embedded quote

   "Quote \"this\"",text # Escaped quote in quoted text (not handled
by Gnumeric)

   "Multi                # Embedded line breaks in text
   line
   field,text

   Foo,Bar"Baz,Bof       # Quote inside unquoted field

I don't recall if we have seen files with single quotes.  I certainly
wouldn't rule
it out offhand.

It should be fairly clear that if you get the quoting style and/or the field
separator wrong, then you are stuck with a pile of garbage.

The Gnumeric solution here is to assume the RFC-4180 quoting style,
and then look at
the text in the file, notably on a line starting with a quote if there
is any, and
make a guess and the field separator.  In practice it seems to work.

Let's assume we got it right, and move on.  Note, that we are now
leaving anything
that RFC-4180 can help us with.


Phase 3: The semantic level

This is the phase that takes a, roughly, rectangular array of text
bits and tries
to make sense of it.

VERY IMPORTANT NOTE: Gnumeric (and other spreadsheets) interpret data when they
are entered.  If we determine that something is a number, then it will be
stored as a number and it does not matter whether it was entered as "100",
"1e2", "00100" or "100.000".  We do not carry the string it came from around
(and depending on context such a string may not even exist in the first place).
This mean that we know the meaning of the data and that it does not change if,
say, the spreadsheet is later loaded somewhere in Europe where "100.000" could
be a hundred thousand.  On don't get me started on dates!

If this is not the behaviour you want, then you must arrange for that particular
piece of data to be interpreted as a string, not a number or anything
else.  This
is one of the major pitfalls of csv files and the place falls squarely
on the csv
format because it has no way of saying "that's a string!"

The use of quotes in csv files is *not* an indication that a piece of data is a
string.  Several reasons: (1) in decimal-comma locales most numbers have to be
quoted; (2) most csv producers quote everything; (3) I know of no csv producing
programs that actually try to use quotes for that purpose and I know of no csv
consuming programs that try to interpret files that way.

The Gnumeric solution here is to interpret the text as-if it was
entered in a cell.
This has both good and bad aspects:

1. A single initial quote can be used to interpret the rest as text.
   Occasionally useful; rarely a problem.

2. An initial "=" can be used to force interpretation as a formula.
   Occasionally useful; rarely a problem.

3. Numbers in all kinds of formats are understood.  ("100", "100.22",
"100,100.22",
   "$100.22", "1e+09", ...)
   See below for decimal-comma versus decimal-point.

4. Dates get interpreted.  ("21-Jan-2020", "2002-09-09", "Jan22", "1/2/22", ...)
   This is a source of problems, especially with short formats that do
not contain
   the year.  See below for m/d/y versus d/m/y considerations.

5. Anything that doesn't get interpreted otherwise becomes a string.


Gnumeric looks at entire columns in order to guess what date format is
being used.
If you have "1/2/2000" and later "31/3/2000" in the same column, then we deduce
that the format is d/m/y and the former gets interpreted at
2000-02-01.  If you have
instead had "3/31/2000" in the same column, you would get 2000-01-02.

Similarly, Gnumeric looks at whole columns in order to tell whether "100,200" is
a little more than one hundred thousand (i.e., the comma is a
thousands separator)
or a little more than one hundred (i.e., the comma is a decimal separator).

If a column's actual text doesn't resolve which format is being used,
the locale's
convention is used.  That can mean data corruption if your 1/2/2000
gets interpreted
the wrong way.  If the column is inconsistent,  you have my sympathy.


Conclusions:

You should read all of the above as "Gnumeric makes a guess as to what the data
in the csv file means".  The need to make a guess is the fault of the csv file
format.  Other programs have to make guesses too, and the results may be
different from Gnumeric's.

There are certainly corner cases where one could argue that Gnumeric makes the
wrong guess.  There are also files for which two different reasonable guesses
are possible, i.e., no matter what guess you make you are wrong!

When Gnumeric interprets csv or other text import the wrong way (i.e.,
differently from what you want), it is generally a mistake to try to fix the
data in the sheet using search-and- replace or hand editing.  The right way is
to invoke the configurable gui text importer and set the right format.
A possible exception to this is US postal codes ("zip codes") for which you
can probably get away with using a format of "00000" to restore.

I always advise people not to use csv files if possible.  When that is not
possible, my advice is to use as simple csv files as possible.  That means:

1. RFC-4180 syntax.  Avoid multi-line fields.
2. Simple numbers only, no thousands separator and no currency.
3. Dates in 2020-Oct-11 or 2020-10-11 format.
4. No strings that can be interpreted as numbers -- watch out for the gene name
   "MARCH1", zip code "00142", and "1e2".
5. Don't start strings with an equal sign or a single quote.

(4) is tricky because it is vague.

If you can't do this, you're on thin ice and should identify the problem
correctly -- csv is deficient -- and move to another format.

If you have placed "MARCH1" in a csv file and the information "it's a gene
name" elsewhere (in your head or some other file) then you really should not
blame Excel, Gnumeric, or anyone else when your data gets mangled.

Morten

Follow-Ups:
- Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)
  - From: Britton Kerin
- Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)
  - From: John Denker
- Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)
  - From: John Denker
- Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)
  - From: William Park
- Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)
  - From: Morten Welinder

References:
- to csv or not to csv
  - From: John Denker

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]