Re: [Tracker] Extracting Embedded Licenses

From: jamie <jamiemcc blueyonder co uk>
To: Jason Kivlighn <jkivlighn gmail com>
Cc: CC Developer Mailing List <cc-devel lists ibiblio org>, tracker-list gnome org
Subject: Re: [Tracker] Extracting Embedded Licenses
Date: Sat, 30 Jun 2007 19:21:14 +0100

On Mon, 2007-06-18 at 12:49 -0700, Jason Kivlighn wrote:

Hi,

imagemagick: Uses 'convert filename xmp:-' to output an image's embedded
XMP.  This works for at least JPEG and TIFF files.  For JPEGs, however,
Imagemagick outputs the namespace and XMP, seperated by \0.  I'm not
sure how I can handle this, without simply assuming that 'convert'
returned two null-terminated strings.  Nevertheless, this extracts the
XMP from TIFF files.


Thats a bit crappy of imagemagick - its too dangerous to look past the
\0 as we dont know the length of the entire string thats returned

Could you use libexif for jpegs here?


msoffice: Extends the msoffice extractor to also parse the
DocumentSummeryInformation infile, which contains user-defined metadata,
along with license metadata embedded by the MSOffice Creative Commons Add-in

pdf: Extends the pdf extractor to read a PDF's metadata stream and parse
it as XMP.  I'm still awaiting poppler extending the glib bindings to
allow reading the metadata stream.  Until then, it will simply never
find the metadata stream and go on without error.

png: Adds a check for the XML:com:adobe:xmp iTXt field, and parses it as
XMP.

html: Adds a new html parser using libxml2.  Parses the document,
checking for RDFa licenses.  It also checks for other basic HTML
properties like title and author.

There's also several XML formats I'd like to parse for license data,
particularly SVG and SMIL.  Would this be do-able, and if so, how should
I go about it?  Write new extractors for each format or is this too much
overhead?  These could use GMarkupParse, rather than bringing in libxml2
like the HTML parser.



these look ok - will apply as soon as the other stuff is tidied up (need
tracker_read_xmp before I can apply these)

is the reading of xmp just limited to license info or can we extract
additional stuff? I believe the xmp format defines expected metadata for
images so it would be nice if we could fully exploit these.

If you are using libxml for html metadata then I suppose its ok to use
it for the others

I have been thinking of splitting tracker extract into serveral
independent executables to prevent slowdown from linking so many libs


jamie

References:
- [Tracker] Extracting Embedded Licenses
  - From: Jason Kivlighn

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]