Re: [Tracker] Extracting Embedded Licenses



Whoops, I forgot the intro on this.

This is my progress thus far with extracting licenses from various
formats.    Jamie, I'm curious on your thoughts on adding new extractors
(besides the ones mentioned below, GIF is another I have in mind.  I'm
not sure whether or not it's worthwhile, however).  I don't want to be
adding bloat...

Cheers,
Jason

Jason Kivlighn wrote:
Hi,

imagemagick: Uses 'convert filename xmp:-' to output an image's embedded
XMP.  This works for at least JPEG and TIFF files.  For JPEGs, however,
Imagemagick outputs the namespace and XMP, seperated by \0.  I'm not
sure how I can handle this, without simply assuming that 'convert'
returned two null-terminated strings.  Nevertheless, this extracts the
XMP from TIFF files.

msoffice: Extends the msoffice extractor to also parse the
DocumentSummeryInformation infile, which contains user-defined metadata,
along with license metadata embedded by the MSOffice Creative Commons Add-in

pdf: Extends the pdf extractor to read a PDF's metadata stream and parse
it as XMP.  I'm still awaiting poppler extending the glib bindings to
allow reading the metadata stream.  Until then, it will simply never
find the metadata stream and go on without error.

png: Adds a check for the XML:com:adobe:xmp iTXt field, and parses it as
XMP.

html: Adds a new html parser using libxml2.  Parses the document,
checking for RDFa licenses.  It also checks for other basic HTML
properties like title and author.

There's also several XML formats I'd like to parse for license data,
particularly SVG and SMIL.  Would this be do-able, and if so, how should
I go about it?  Write new extractors for each format or is this too much
overhead?  These could use GMarkupParse, rather than bringing in libxml2
like the HTML parser.

Cheers,
Jason

  




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]