Re: [Tracker] libstreamanalyzer



On 24/11/09 10:30, Evgeny Egorochkin wrote:
Ð ÑÐÐÐÑÐÐÐÐ ÐÑ ÐÑÐÑÐÐÐ 24 ÐÐÑÐÑÑ 2009 10:02:04 ÐÐÑÐÑ Martyn Russell ÐÐÐÐÑÐÐ:
On 24/11/09 01:33, Evgeny Egorochkin wrote:

Hi,

I have forwarded this to the mailing list so Jos can see it there. This may also be interesting to others.

Hi Martyn,

I've been informed that you decided to not use libstreamanalyzer at this
moment.

While this is sad, even sadder is the fact that we haven't received any
feedback. If we don't know what's broken, it's very unlikely to be fixed.

If you have a list of issues, I'm sure it will help to improve
libstreamanalyzer ;)

Hello,

I didn't make a definitive decision. I did an initial analysis and my
testing is not yet finished. For now, we decided to make Tracker work
*with* LSA and as a fallback.

Oh ok. I'm glad I was mislead :)
I think it's a natural way to use LSA initially in cases where tracker-extract
can't work at all because something is better than nothing and as issues get
sorted out try using lsa in other cases.

I need to do more testing with other data sets to be sure, but
initially, we found that LSA didn't provide as much ontology as our
in-house extractors

This may be skewed by some trivial or "technical" properties(like one library
providing file extension and other not providing it) or some file type. But of
course some LSA analyzers aren't very verbose.

That's to be expected.

and speed wise wasn't much different.

Analyzers are mostly IO-bound. The likely reasons for the difference is some
analyzer doing unnecessary seeking(or necessary seeking and providing more
information) and use of external lib which may introduce unknown overheads.

So any  major speed difference in equal circumstances = bug :)

:)

I don't think I ever profiled lsa, so it might be a very trivial thing as well.

    w/  LSA: (many ontology errors in tracker-store which may contribute)

    Tracker-Message: Finished mining in seconds:150.287733, total
    directories:2460, total files:23738

    --

    w/o LSA:

    Tracker-Message: Finished mining in seconds:98.096369, total
    directories:2460, total files:23738

The MP3 extractor was limited too. There were some metadata which we
absolutely would need which was missing from some brief testing.

This is really strange. I made sure that LSA's analyzer supports all
properties supported by trunk tracker-extract. The only real difference atm is
that LSA supports ID3v1, v2.3, v2.4 and tracker-extract also supports v2(which
is rather uncommon).

We have a LOT of broken MP3s too, so testing those often finds interesting results. We have been fixing these cases for > 6 months now.

GStreamer is not supported as standard which means adding support for
each video/audio format is going to be painful. GStreamer has years of
experience in this field. To do it all again for LSA doesn't feel like
the right way to go here.

I don't think there's a conflict between using gstreamer and in-house
analyzers. There's nothing wrong with optimizing some in-house analyzers for
the most popular formats and use GStreamer for everything else. The wrong
thing is that there's no GStreamer analyzer atm.

I agree.

I did look into a GStreamer analyzer as you suggested. To be honest I expected
it to be much easier. Tracker's analyzer looks really scary, so instead I
focused on other issues which I felt were important too and (looked) easier.

Well, GStreamer is interesting :) we find it isn't especially fast either and have considered keeping things resident (almost like a daemon) to make processing files faster. The slowness IIRC is from creating all the pipelines initially and doing that for each file isn't great.

This however doesn't mean that making GSteramer analyzer is going to be harder
than in tracker or that such effort is not welcome.

I did dump the sparql that was generated into a log file (just the
sparql) for each case:

    w/  LSA:
    -rw-r--r-- 1 martyn martyn 2004430 2009-10-28 12:29 tracker-
    extract.log.withLSA

    w/o LSA:
    -rw-r--r-- 1 martyn martyn 9800568 2009-10-28 12:36 tracker-
    extract.log.withoutLSA

So initially EVEN with out fixing tracker-store's missing ontologies
that LSA exports, we still produce 5 times the data for ~24k files. This
may be affected by the FTS content we include however.

I need to do some more testing to actually get a better all round
perspective. I didn't index much music or many images. So I want to do
that at some point. The stats produced from this data set were:

Statistics:
    mfo:Action = 1
    mto:State = 9
    mto:TransferMethod = 2
    mtp:ScanType = 6
    nco:Contact = 17
    nco:Role = 18
    nfo:Audio = 100
    nfo:BookmarkFolder = 2
    nfo:DataContainer = 1746
    nfo:Document = 4014
    nfo:Executable = 263
    nfo:FileDataObject = 17374
    nfo:Folder = 1631
    nfo:Image = 2511
    nfo:Media = 2611
    nfo:MediaFileListEntry = 6
    nfo:MediaList = 4
    nfo:Orientation = 8
    nfo:PaginatedTextDocument = 210
    nfo:PlainTextDocument = 3804
    nfo:Software = 263
    nfo:SoftwareApplication = 263
    nfo:SoftwareCategory = 115
    nfo:TextDocument = 4014
    nfo:Video = 82
    nfo:Visual = 2593
    nie:DataObject = 17374
    nie:DataSource = 2
    nie:InformationElement = 17615
    nmm:Artist = 1
    nmm:Flash = 2
    nmm:MeteringMode = 7
    nmm:MusicPiece = 18
    nmm:Photo = 1932
    nmm:RadioModulation = 2
    nmm:Video = 82
    nmm:WhiteBalance = 2
    rdf:Property = 520
    rdfs:Class = 208
    rdfs:Resource = 18454
    scal:AccessLevel = 3
    scal:AttendanceStatus = 7
    scal:AttendeeRole = 4
    scal:CalendarUserType = 5
    scal:EventStatus = 3
    scal:JournalStatus = 4
    scal:RSVPValues = 2
    scal:TodoStatus = 4
    scal:TransparencyValues = 2
    tracker:Namespace = 22
    tracker:Volume = 1

Hope this helps. I would like to investigate some more though.

Thanks. This did clarify the matter quite a bit. I hope your testing branch
will be easily accessible sometime soon so that I could use it to find and fix
at least the most obvious issues.

Can I forward this email to Jos van den Oever? I guess he'll be interested
too.

Sent to the mailing list instead. The link should be available from here when it arrives:

  http://mail.gnome.org/archives/tracker-list/2009-November/thread.html

--
Regards,
Martyn



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]