Re: More than you ever wanted to know about csv files (Re: to csv or not to csv)



We have been repeatedly told to implement what we want using the
configurable text importer.  Here's what I want; how do I implement it?

*** At the character-encoding layer:
 1) The encoding is utf-8.  An optional BOM is discarded.

*** At the syntactic layer (the CSV layer):
   *) Small departures from RFC 4180 are marked [!].
 2.1) The record terminator is ([\r]\n) [!].
      The last record may be terminated by EoF instead. (Assuming we've seen
      a record with nonzero length; otherwise there's no record here at all.)
 2.2) The field separator is comma.  You know, as in comma-separated values.
 2.3) The last field in the record does not have a trailing comma.
 2.4) Any field containing a comma, quote, or newline must be quoted.
      The field as a whole gets quoted.
      Otherwise quoting is optional.
      Quoting does not change the meaning of the field.
 2.5) The quote character is (").
 2.6) A quoted quote must be doubled.
      "He said ""duck"" but it was too late."
 2.7) Spaces are part of the field and are not discarded.
 2.8) The number of fields may [!] vary from record to record.
 2.9) At this layer, violations throw an error.
      For example: A stray quote (foo"bar) is illegal.
   *) The output of this layer is called a "raw string".

*** At the semantic layer, in the context of gnumeric:
   *) Except for rule 3.3 which is stricter, the raw string gets interpreted
      as if it were typed into the gnumeric formula bar. In particular:
 3.1) An initial single quote marks the field as a gnumeric string.
 3.2) An initial equal-sign marks the field as a formula.
 3.3) If it looks like an unambiguous [!] date, it gets treated as a date.
      Examples include
          2020-oct-1
          2020-10-01
          2020-10-01T11:16:15Z
          1/oct/2020
          oct/01/2020
      The year must [!] be four digits.
      If the year comes last, the month must [!] be alphabetic.
 3.4) If it looks like a number it's a number.
      The decimal point is period.
      Grouping with commas is optional (but see 2.4).
      Leading and/or trailing spaces are optional for numbers.
      Leading + sign is optional.
 3.5) Anything else is interpreted as a gnumeric string.
      This includes bogus dates such as 10-11-12.
      This includes bogus numbers such as (123 456).
      This includes zero-length strings.

*** In general, at all layers:
 4.1) All fields and all records play by the same rules.
      This means header rows and header columns, if any, are not special.
      Each field is independent of all others. Context is not examined.
      Heuristics are not employed.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]