GLib file magics



hi all,

some while back, i was looking for a solution to maintain
soundfile->loader_module lookups in BSE.
i decided to go for magic matches, like the gimp had done
before, as introduced with system V as file(1) and magic(5).

for that, i took a long look at the gimp code and while
it covers most of what's needed for an image loader, it's
code has issues with endianess and implements some
extensions of which some obscure ones are still unused.
also it's code is inherently tied to the PDB and other
code portions, so for BSE i went for a reimplementation.

gimp could benefit from the new version, and after talking
to hp and jrb, it's probably also suitable for pixbuf
loader matches, so i'm herewith proposing the addition
of magic match facilities to glib.
for those that want to look at the current code:
in cvs module beast, bse/bsemagic.[hc] cover the implementation
and bsemagictest.c gets compiled into a small testprogram
./bsemagic to work file(1) alike on command line args.

for GLib we'll need something a bit more generic than the
BSE code, so i imagine the new API as:

struct _GMagic
{
  gpointer data;
  GQuark   qextension;

  /*< private >*/
  gint     priority;
  gpointer match_list;
};

GMagic*       g_magic_create                  (gpointer     data,
                                               gint         priority,
                                               GQuark       qextension,
                                               const gchar *magic_spec);
void          g_magic_free                    (GMagic      *magic);
GMagic*       g_magic_list_match_file         (GSList      *magic_list,
                                               const gchar *file_name);
GSList*	      g_magic_list_sort               (GSList      *magic_list);

since pixbufs will probably be constructable from selection data
in the future, we might also want a variant:

GMagic*       g_magic_list_match_data         (GSList       *magic_list,
                                               guint         length,
                                               const guint8 *data);

GQuark qextension is the file extension, e.g. ".gif", ".wav" or ".out",
it's passed in as a GQuark because internally i'm using integer
comparisons for faster matches. some might find that it's a bad idea
to expose that in the API, if so, we can change that to const
gchar *extension there.
i'm pretty indiferent on that issue, since once a string
is used as a quark somewhere, it's char[]<->GQuark association is
invariant and therefore they are losslessly interchangable.

const gchar *magic_spec, is a magic specification, featuring a
good subset of the magic(5) specification, i.e. GMagic

- supports byte, short, leshort, beshort, long, lelong, belong
  and string match types
- supports <type>&NUMBER syntax for ANDing the file's contents with
  NUMBER, before any comparisons are attempted. NUMBER can be decimal,
  octal (preceding '0'), or hexadecimal (preceeding "0x")
- supports u<type> to make comparisons of numeric types unsigned
- supports gimp's "size" extension, a numeric type to match the file size

the overall syntax of magic matches is:

DELIM		::= ' ' | '\t' | ','
MAGIC_LINE	::= OFFSET DELIM TYPE DELIM TEST

where OFFSET can be a decimal, hexadecimal, or octal number,
used as an offset into the file's data.
TYPE is any of the above subset, and TEST is the value the file's
content at offset is to be compared against. this can be:

- a number (decimal, hexadecimal or octal), optionally
  prefixed by
  =	for is equal to checks (default if no prefix is given)
  >	for file value is greater than checks
  <	for file value is smaller than checks
  &	to check for 1 bits in the file
  ^	to check for 0 bits in the file
  x	to always succeed
- a string, including C language style escapes ('\\', '\t', '\n', '\r'
  '\b', '\f', '\s', '\e', '\000'), optionally prefixed by '=', '>' or
  '<' with the same meaning as for the number tests


the magic(5) specification serves a slightly different purpose
than what we need for glib, it's intended to match files with
increasing messaging verbosity, which is why it works like this:

#offset	type	test	message
0	string	=FOO	This is a Foo file
>3	string	=BAR	with Bar frobnication

which reults in:
"FOO"		->	This is a Foo file
"FOOBAR"	-> 	This is a Foo file with Bar frobnication

that's not exactly what we want, so GMagic skips the message
part, and depends on exact matches for successive lines.
thus, GMagic lists match like:

GSList *list = NULL;
gchar *content;

list = g_slist_prepend (list,
                        g_magic_create ("Foo file", 0, 0,
                                        "0,string,FOO"));
list = g_slist_prepend (list,
                        g_magic_create ("FooBar file", 0, 0,
                                        "0 string	=FOO\n"
                                        "# we only match FOOBAR files\n"
                                        "3 string	=BAR\n");
content = "FOOBAR";
g_print ("%s: %s\n",
         content,
         g_magic_list_match_data (list, strlen (content), content)->data);
/* FOOBAR: FooBar file */

content = "FOO";
g_print ("%s: %s\n",
         content,
         g_magic_list_match_data (list, strlen (content), content)->data);
/* FOO: Foo file */


so multiple match lines per GMagic are all required to succeed
for a GMagic to get returned from g_magic_list_match_*().

the priority member of a GMagic is in place to support order of
matches in a list. e.g. for the above example, if g_slist_prepend is
replaced by g_slist_append, the result would be:

FOO: Foo file
FOO: Foo file

because "0 string =FOO" would be checked prior to
"0,string,=FOO\n3,string,BAR" and succeed, thus ending the match
attempts.
using priorities, e.g.

#define G_MAGIC_PRIORITY_HIGHEST  -500
#define G_MAGIC_PRIORITY_HIGH     -250
#define	G_MAGIC_PRIORITY_DEFAULT   0
#define	G_MAGIC_PRIORITY_LOW	   250
#define	G_MAGIC_PRIORITY_FALLBACK  500

matches can be attempted in order of priority, so giving
"0,string,=FOO\n3,string,BAR" a higher priority than
"0 string =FOO" and sorting the list with g_match_list_sort(),
will always produce correct results.

another note on the file extensions. since file extensions
are not necessarily reliable, GQuark qextension is only
used to furtherly affect match order, resulting in major
speedups for the common case. i.e. from a list of different
magics with different file extensions, the magics's with
matching file extension are first checked in order of priority,
and only if all of them failed, the remaining magics with
non-matching file extensions are checked in order of priority.

on the implementation of g_magic_list_match_file(), internally
it uses some semi-clever heuristics of keeping around two
distinct file data buffers, with the second one being used
for offsets >768 (that value is adjustable of course).
that's more than sufficient for most matches (in my current
/usr/share/misc/magic, only 20 out of 904 file entries
require file contents beyond an offset of 768).
for that, an extra open/read/stat/close layer is used, that'll
also make the implementation of g_magic_list_match_data() a
no-brainer (for those compared about bloat, we're talking about
~700 lines total code size for gmagic.c here).


what's left is maybe short example magics, a quote from
bsemagictest.c, to familiarize the inclined reader with
the syntax:

  static const gchar *magic_presets[][2] = {
    /* some test entries, order is important for some cases,
     * untill we store priorities here as well
     */
    { "Berkeley DB 2.X Hash/Little Endian",     "12 lelong 0x061561", },
    { "MS-DOS executable (EXE)",                "0 string MZ", },
    { "ELF object file",                        "0 string \177ELF", },
    { "Bourne shell script text",               "0,string,#!/bin/sh", },
    { "Bourne shell script text",               "0,string,#!\\ /bin/sh", },
    { "Bourne shell script text",               "0,string,#!\\t/bin/sh", },
    { "GIF image data",                         "0,string,GIF8", },
    { "X window image dump (v7)",               ("# .xwd files\n"
                                                 "0x04,ubelong,0x0000007"), },
    { "RIFF (little-endian), WAVE audio",       ("0 string RIFF\n"
                                                 "8 string WAVE"), },
    { "RIFF (little-endian) data",              "0 string RIFF", },
  };

---
ciaoTJ





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]