adding metadata to documents via web scraping
- From: "D Baser" <dblips gmail com>
- To: dashboard-hackers gnome org
- Subject: adding metadata to documents via web scraping
- Date: Thu, 29 May 2008 14:43:28 +0200
Hi,
I'm trying to add metadata to local files, here films, by indexing
appropriate web content, here text from the imdb site.
Thus, for a start I set up an external filter (internal would be nicer of
course for adding specific properties such as director, actor, title, year,
rating, etc.):
<filter>
<mimetype>video/x-msvideo</mimetype>
<extension>.avi</extension>
<command>beagleFilterMovies.pl</command>
<arguments>%s</arguments>
</filter>
This external filter calls a perl script to retrieve the appropriate webpage
from a filmsite and return its content as plain text. The filename, e.g.
"Indiana_Jones_4.avi" is used within a Google I'm Feeling Lucky query...
(see script below).
Somehow I do net get results back when searching afterwards in beagle for,
say, "harrison ford".
Any idea why that doesn't work? The script gets called as I see in my
test.log.
Is maybe for videos content indexing disabled?
Cheers, d. baser
Perl script beagleFilterMovies.pl:
#!/usr/bin/perl
$s = $ARGV[0];
`echo beagle found file $s >> beagle-test.log`;
# clean filename to use in query
$s =~ s/\.avi$//ig;
$s =~ s/[^a-zA-Z0-9-]/+/g;
# get html of film page
$c = `lynx -source http://www.google'.com/search?q=$s+site%3Awww.imdb.com
%2Ftitle&btnI'`;
# strip html tags
$c =~ s/<script.*?>(.*?|\n)*<\/script>/ /g;
$c =~ s/<style.*?>(.*?|\n)*<\/style>/ /g;
$c =~ s/<(([^>])+)>/ /g;
$c =~ s/&[a-z#0-9]+;/ /g;
print $c;
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]