Re: [Tracker] Extracting Embedded Licenses



On Mon, 2007-06-18 at 13:20 -0700, Jason Kivlighn wrote:
Whoops, I forgot the intro on this.

This is my progress thus far with extracting licenses from various
formats.    Jamie, I'm curious on your thoughts on adding new extractors
(besides the ones mentioned below, GIF is another I have in mind.  I'm
not sure whether or not it's worthwhile, however).  I don't want to be
adding bloat...

Cheers,
Jason

Jamie, plese drop us a line to discuss this project. Did you get the
chat time invite?

jon

Jason Kivlighn wrote:
Hi,

imagemagick: Uses 'convert filename xmp:-' to output an image's embedded
XMP.  This works for at least JPEG and TIFF files.  For JPEGs, however,
Imagemagick outputs the namespace and XMP, seperated by \0.  I'm not
sure how I can handle this, without simply assuming that 'convert'
returned two null-terminated strings.  Nevertheless, this extracts the
XMP from TIFF files.

msoffice: Extends the msoffice extractor to also parse the
DocumentSummeryInformation infile, which contains user-defined metadata,
along with license metadata embedded by the MSOffice Creative Commons Add-in

pdf: Extends the pdf extractor to read a PDF's metadata stream and parse
it as XMP.  I'm still awaiting poppler extending the glib bindings to
allow reading the metadata stream.  Until then, it will simply never
find the metadata stream and go on without error.

png: Adds a check for the XML:com:adobe:xmp iTXt field, and parses it as
XMP.

html: Adds a new html parser using libxml2.  Parses the document,
checking for RDFa licenses.  It also checks for other basic HTML
properties like title and author.

There's also several XML formats I'd like to parse for license data,
particularly SVG and SMIL.  Would this be do-able, and if so, how should
I go about it?  Write new extractors for each format or is this too much
overhead?  These could use GMarkupParse, rather than bringing in libxml2
like the HTML parser.

Cheers,
Jason

  

_______________________________________________
tracker-list mailing list
tracker-list gnome org
http://mail.gnome.org/mailman/listinfo/tracker-list

-- 
Jon Phillips

San Francisco, CA
USA PH 510.499.0894
jon rejon org
http://www.rejon.org

MSN, AIM, Yahoo Chat: kidproto
Jabber Chat: rejon gristle org
IRC: rejon irc freenode net




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]