Re: Comparing 2 Columns

From: Tim Chase <gnumeric tim thechases com>
To: bill d <billd post com>
Cc: gnumeric-list gnome org
Subject: Re: Comparing 2 Columns
Date: Wed, 19 Aug 2020 08:25:40 -0500

On 2020-08-19 07:57, bill d wrote:

They are text. Specifically, 1 column contains names of moth
species known from my state. The second column has names of moth
species known from an adjoining state. Most will be present in both
states, which doesn't interest me. Some will be listed in one state
or the other, but not both. This is the data i am trying to
isolate. While i can do this manually, we are talking some 1,900
and 1,700 names. Further, this will surely come up again in future
so I'd like a quicker solution. Many thanks for any suggestions!


I usually use command-line tools on *nix for this.  If you have one
file of moth-names in each state:

  $ cat local.txt
  george
  jane
  elroy
  judy

  $ cat adjacent.txt
  david
  judy
  jane
  michael

To find the ones where you only have one instance, you might do

  $ sort local.txt adjacent.txt | uniq -c | awk '$1 == 1{print $2}'
  david
  elroy
  george
  michael

This counts the unique instances of each moth name and then only
prints where the count is 1 ("$1 == 1") meaning the moth-name only
appeared once between the two files.

This falls over if either file can have the same moth more than once.
For that, I'd use a bit of pure awk (which should be available on any
Unix-like, whether Linux, a BSD, or MacOS X)

 $ awk '{++a[$1];if (NR == FNR)++loc[$1]; else ++adj[$1]} END {for (n in a) if (!(n in loc && n in adj)) 
print n}' local.txt adjacent.txt 

That puts all the names in the "a" array, the local ones in the "loc"
array, and the adjacent-state's in the "adj" array, then after
loading them all, goes through all of them to ask which isn't in
both.  If you want to run it regularly, you can put it in a script:

  $ cat > disjoint.awk
  #!/usr/bin/awk -f
  {
    ++a[$1]
    if (NR == FNR)
      ++loc[$1]
    else
      ++adj[$1]
  }
  END {
    for (n in a)
      if (!(n in loc && n in adj))
        print n
  }

then make it executible:

  $ chmod +x disjoint.awk

and run it as needed:

  $ ./disjoint.awk local.txt adjacent.txt

You can tweak that last condition for whatever stats you want
to run, so if you just want local moths that aren't in the adjacent
state

  if (n in loc && ! n in adj)

or just adjacent moths that aren't local

  if (n in adj && ! n in loc)

The nice thing about this is that if you keep your historical data in
different files, you don't have to recreate the logic:

  $ ./disjoint.awk local2019.txt adjacent2019.txt
  $ ./disjoint.awk local2020.txt adjacent2020.txt

Sorry this is a Gnumeric list and I'm providing command-line-utility
solutions, but for doing repeatable tasks with varying input, I find
it easier to script the solution that I can reuse, rather than hoping
I got my expressions right in a spreadsheet. :-)

-tim

References:
- Re: Re: Comparing 2 Columns
  - From: bill d

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]