Re: [Tracker] libstreamanalyzer
- From: Martyn Russell <martyn lanedo com>
- To: Evgeny Egorochkin <phreedom stdin gmail com>, Tracker mailing list <tracker-list gnome org>
- Subject: Re: [Tracker] libstreamanalyzer
- Date: Tue, 24 Nov 2009 12:25:09 +0000
On 24/11/09 10:30, Evgeny Egorochkin wrote:
Ð ÑÐÐÐÑÐÐÐÐ ÐÑ ÐÑÐÑÐÐÐ 24 ÐÐÑÐÑÑ 2009 10:02:04 ÐÐÑÐÑ Martyn Russell ÐÐÐÐÑÐÐ:
On 24/11/09 01:33, Evgeny Egorochkin wrote:
Hi,
I have forwarded this to the mailing list so Jos can see it there. This
may also be interesting to others.
Hi Martyn,
I've been informed that you decided to not use libstreamanalyzer at this
moment.
While this is sad, even sadder is the fact that we haven't received any
feedback. If we don't know what's broken, it's very unlikely to be fixed.
If you have a list of issues, I'm sure it will help to improve
libstreamanalyzer ;)
Hello,
I didn't make a definitive decision. I did an initial analysis and my
testing is not yet finished. For now, we decided to make Tracker work
*with* LSA and as a fallback.
Oh ok. I'm glad I was mislead :)
I think it's a natural way to use LSA initially in cases where tracker-extract
can't work at all because something is better than nothing and as issues get
sorted out try using lsa in other cases.
I need to do more testing with other data sets to be sure, but
initially, we found that LSA didn't provide as much ontology as our
in-house extractors
This may be skewed by some trivial or "technical" properties(like one library
providing file extension and other not providing it) or some file type. But of
course some LSA analyzers aren't very verbose.
That's to be expected.
and speed wise wasn't much different.
Analyzers are mostly IO-bound. The likely reasons for the difference is some
analyzer doing unnecessary seeking(or necessary seeking and providing more
information) and use of external lib which may introduce unknown overheads.
So any major speed difference in equal circumstances = bug :)
:)
I don't think I ever profiled lsa, so it might be a very trivial thing as well.
w/ LSA: (many ontology errors in tracker-store which may contribute)
Tracker-Message: Finished mining in seconds:150.287733, total
directories:2460, total files:23738
--
w/o LSA:
Tracker-Message: Finished mining in seconds:98.096369, total
directories:2460, total files:23738
The MP3 extractor was limited too. There were some metadata which we
absolutely would need which was missing from some brief testing.
This is really strange. I made sure that LSA's analyzer supports all
properties supported by trunk tracker-extract. The only real difference atm is
that LSA supports ID3v1, v2.3, v2.4 and tracker-extract also supports v2(which
is rather uncommon).
We have a LOT of broken MP3s too, so testing those often finds
interesting results. We have been fixing these cases for > 6 months now.
GStreamer is not supported as standard which means adding support for
each video/audio format is going to be painful. GStreamer has years of
experience in this field. To do it all again for LSA doesn't feel like
the right way to go here.
I don't think there's a conflict between using gstreamer and in-house
analyzers. There's nothing wrong with optimizing some in-house analyzers for
the most popular formats and use GStreamer for everything else. The wrong
thing is that there's no GStreamer analyzer atm.
I agree.
I did look into a GStreamer analyzer as you suggested. To be honest I expected
it to be much easier. Tracker's analyzer looks really scary, so instead I
focused on other issues which I felt were important too and (looked) easier.
Well, GStreamer is interesting :) we find it isn't especially fast
either and have considered keeping things resident (almost like a
daemon) to make processing files faster. The slowness IIRC is from
creating all the pipelines initially and doing that for each file isn't
great.
This however doesn't mean that making GSteramer analyzer is going to be harder
than in tracker or that such effort is not welcome.
I did dump the sparql that was generated into a log file (just the
sparql) for each case:
w/ LSA:
-rw-r--r-- 1 martyn martyn 2004430 2009-10-28 12:29 tracker-
extract.log.withLSA
w/o LSA:
-rw-r--r-- 1 martyn martyn 9800568 2009-10-28 12:36 tracker-
extract.log.withoutLSA
So initially EVEN with out fixing tracker-store's missing ontologies
that LSA exports, we still produce 5 times the data for ~24k files. This
may be affected by the FTS content we include however.
I need to do some more testing to actually get a better all round
perspective. I didn't index much music or many images. So I want to do
that at some point. The stats produced from this data set were:
Statistics:
mfo:Action = 1
mto:State = 9
mto:TransferMethod = 2
mtp:ScanType = 6
nco:Contact = 17
nco:Role = 18
nfo:Audio = 100
nfo:BookmarkFolder = 2
nfo:DataContainer = 1746
nfo:Document = 4014
nfo:Executable = 263
nfo:FileDataObject = 17374
nfo:Folder = 1631
nfo:Image = 2511
nfo:Media = 2611
nfo:MediaFileListEntry = 6
nfo:MediaList = 4
nfo:Orientation = 8
nfo:PaginatedTextDocument = 210
nfo:PlainTextDocument = 3804
nfo:Software = 263
nfo:SoftwareApplication = 263
nfo:SoftwareCategory = 115
nfo:TextDocument = 4014
nfo:Video = 82
nfo:Visual = 2593
nie:DataObject = 17374
nie:DataSource = 2
nie:InformationElement = 17615
nmm:Artist = 1
nmm:Flash = 2
nmm:MeteringMode = 7
nmm:MusicPiece = 18
nmm:Photo = 1932
nmm:RadioModulation = 2
nmm:Video = 82
nmm:WhiteBalance = 2
rdf:Property = 520
rdfs:Class = 208
rdfs:Resource = 18454
scal:AccessLevel = 3
scal:AttendanceStatus = 7
scal:AttendeeRole = 4
scal:CalendarUserType = 5
scal:EventStatus = 3
scal:JournalStatus = 4
scal:RSVPValues = 2
scal:TodoStatus = 4
scal:TransparencyValues = 2
tracker:Namespace = 22
tracker:Volume = 1
Hope this helps. I would like to investigate some more though.
Thanks. This did clarify the matter quite a bit. I hope your testing branch
will be easily accessible sometime soon so that I could use it to find and fix
at least the most obvious issues.
Can I forward this email to Jos van den Oever? I guess he'll be interested
too.
Sent to the mailing list instead. The link should be available from here
when it arrives:
http://mail.gnome.org/archives/tracker-list/2009-November/thread.html
--
Regards,
Martyn
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]