adding metadata to documents via web scraping



Hi,
I'm trying to add metadata to local files, here films, by indexing
appropriate web content, here text from the imdb site.

Thus, for a start I set up an external filter (internal would be nicer of
course for adding specific properties such as director, actor, title, year,
rating, etc.):

<filter>
<mimetype>video/x-msvideo</mimetype>
<extension>.avi</extension>
<command>beagleFilterMovies.pl</command>
<arguments>%s</arguments>
</filter>

This external filter calls a perl script to retrieve the appropriate webpage
from a filmsite and return its content as plain text. The filename, e.g.
"Indiana_Jones_4.avi" is used within a Google I'm Feeling Lucky query...
(see script below).

Somehow I do net get results back when searching afterwards in beagle for,
say, "harrison ford".
Any idea why that doesn't work? The script gets called as I see in my
test.log.

Is maybe for videos content indexing disabled?

Cheers, d. baser


Perl script beagleFilterMovies.pl:

#!/usr/bin/perl
$s = $ARGV[0];

`echo beagle found file $s >> beagle-test.log`;

# clean filename to use in query
$s =~ s/\.avi$//ig;
$s =~ s/[^a-zA-Z0-9-]/+/g;

# get html of film page
$c = `lynx -source http://www.google'.com/search?q=$s+site%3Awww.imdb.com
%2Ftitle&btnI'`;

# strip html tags
$c =~ s/<script.*?>(.*?|\n)*<\/script>/ /g;
$c =~ s/<style.*?>(.*?|\n)*<\/style>/ /g;
$c =~ s/<(([^>])+)>/ /g;
$c =~ s/&[a-z#0-9]+;/ /g;

print $c;


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]