Re: Comparing 2 Columns

On 2020-08-19 07:57, bill d wrote:
They are text. Specifically, 1 column contains names of moth
species known from my state. The second column has names of moth
species known from an adjoining state. Most will be present in both
states, which doesn't interest me. Some will be listed in one state
or the other, but not both. This is the data i am trying to
isolate. While i can do this manually, we are talking some 1,900
and 1,700 names. Further, this will surely come up again in future
so I'd like a quicker solution. Many thanks for any suggestions!

I usually use command-line tools on *nix for this.  If you have one
file of moth-names in each state:

  $ cat local.txt

  $ cat adjacent.txt

To find the ones where you only have one instance, you might do

  $ sort local.txt adjacent.txt | uniq -c | awk '$1 == 1{print $2}'

This counts the unique instances of each moth name and then only
prints where the count is 1 ("$1 == 1") meaning the moth-name only
appeared once between the two files.

This falls over if either file can have the same moth more than once.
For that, I'd use a bit of pure awk (which should be available on any
Unix-like, whether Linux, a BSD, or MacOS X)

 $ awk '{++a[$1];if (NR == FNR)++loc[$1]; else ++adj[$1]} END {for (n in a) if (!(n in loc && n in adj)) 
print n}' local.txt adjacent.txt 

That puts all the names in the "a" array, the local ones in the "loc"
array, and the adjacent-state's in the "adj" array, then after
loading them all, goes through all of them to ask which isn't in
both.  If you want to run it regularly, you can put it in a script:

  $ cat > disjoint.awk
  #!/usr/bin/awk -f
    if (NR == FNR)
  END {
    for (n in a)
      if (!(n in loc && n in adj))
        print n

then make it executible:

  $ chmod +x disjoint.awk

and run it as needed:

  $ ./disjoint.awk local.txt adjacent.txt

You can tweak that last condition for whatever stats you want
to run, so if you just want local moths that aren't in the adjacent

  if (n in loc && ! n in adj)

or just adjacent moths that aren't local

  if (n in adj && ! n in loc)

The nice thing about this is that if you keep your historical data in
different files, you don't have to recreate the logic:

  $ ./disjoint.awk local2019.txt adjacent2019.txt
  $ ./disjoint.awk local2020.txt adjacent2020.txt

Sorry this is a Gnumeric list and I'm providing command-line-utility
solutions, but for doing repeatable tasks with varying input, I find
it easier to script the solution that I can reuse, rather than hoping
I got my expressions right in a spreadsheet. :-)


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]