Re: [Tracker] The Utopian idea, Tracker as it should be

From: Philip Van Hoof <philip codeminded be>
To: Martyn Russell <martyn lanedo com>, Tracker mailing list <tracker-list gnome org>
Subject: Re: [Tracker] The Utopian idea, Tracker as it should be
Date: Wed, 17 Sep 2014 20:07:11 +0200

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 17/09/2014 17:05, Martyn Russell wrote:

o. libtracker-sparql and tracker-store get merged together.
Perhaps we rename libtracker-sparql, perhaps not, perhaps it
doesn't matter.

o. Instances of tracker-store become managed by
libtracker-sparql (through D-Bus service activation or not, it's
an implementation detail of libtracker-sparql either way)

o. Nepomuk becomes an upstream project, managed separately

o. Applications that need to deal with metadata will depend on
Nepmuk (managed separately) and libtracker-sparql. Just like how
they could if they'd use SQLite depend on libsqlite and on their
own DB schema


So far so good. However, I would like opinion from Jürg before we 
dismantle + lift + shift code into libtracker-sparql, because
actually what this means is, libtracker-sparql becomes:

libtracker-common libtracker-data libtracker-bus libtracker-direct 
libtracker-sparql-backend libtracker-sparql tracker-store

ALL in one git repository?


yes, or in one git repository but separately packagable (already the
case for the most part, except that tracker-store can not be bundled
with libtracker-sparql).

Where the code is, is not that important. My opinion on splitting to
multiple repos has more to do with maintainership. If a big group of
maintainers maintains one repo with separate-ish projects, and they
get along just fine that way: that works, and whatever works works.

The exception I'd make would be data/ontologies for the Nepomuk
ontology, as I would really want the ontology to be co-maintained by
KDE and also the other non-desktop industries that want to use Tracker.

We almost got to that point during the Desktop summit. We were even
already discussing setting up repositories and commit rules at that
time. And then it didn't happen, unfortunately.

That's probably not such a bad thing and really just a packaging 
difference in the end (for the most part).


Correct.

The mining of metadata - ----------------------

o. The 'Tracker' project will contain only miners


Actually, I think the 'Tracker' project should be all of the above
for libtracker-sparql with a binary command line 'tracker' used to 
communicate with the DB on the most basic (command line) level.


Ok yes. What 'Tracker' will bundle is less of a concern to me. I think
it should probably end up as a metadata package in Debian (if it is
not already like that) that ensures that the different components get
installed. Maybe a tracker-world package? :)

The main reason for this is that people are used to running
tracker-* commands already and it will be an easier cross over to
keep 'tracker' as the official name.


Yep. The name is fine.

It's what we do: we track your metadata :)

Renaming the SPARQL endpoint and/or libtracker-sparql and Nepomuk:
maybe. Nepomuk probably inevitable. That libtracker-sparql I don't care.

It would imply a total API change, so why would we do that?

I looked at the git source and they have this kind of structure,
mind you they don't really have separate projects either.


Yep. The way git is organized sounds good.

o. The miners will depend on Nepomuk (managed separately) and 
libtracker-sparql (like any other application) and on
tracker-extract


Makes sense to depend on a Nepomuk project for the ontology
indeed.

BUT tracker-extract technically is a miner, so having other miners 
depend on it doesn't really make sense here. It's also optional, if
you only want file data and basic RDF type data (e.g. nfo:Audio),
you shouldn't need tracker-extract.

nod

o. tracker-extract gets a public API (DBus FD passing based) that
isn't deeply coupled with tracker-miner-fs


It isn't right now.

You can index your content without tracker-extract running and then
a week later decide to extract more information and tracker-extract
will populate the rest using extractors. It's coupled to GRAPH
UPDATED and the ontology mainly. Carlos correct me if I am wrong.



Yes. And it's ideal that way: the MTP daemon could in other words play
the role of tracker-miner-fs, and instruct tracker-extract to do the
remainder.

So it's more or less already the way it should be.

That's why I liked the passive-extraction branch :). Love it when we
go in the right direction.

o. tracker-extract therefore becomes a separate project
(applications that want to use it can depend on it, without
having to depend on Tracker's other miners). It deals with
metadata so it too depends on libtracker-sparql and on Nepomuk
(managed separately) (like any other application)


I think this makes sense too. We provide this functionality on the 
command line (i.e. displaying what we know about a file).

I wonder how applications would "use it" - I guess GNOME documents
could decide to get SPARQL from foo.pdf to insert it themselves if
they wanted, but that's really what tracker-extract does already -
I don't see the added value here. We've not done this before and
we've not had requests from people to do this either. I would
rather add this sort of thing later if someone wants it bad
enough.


I guess it would be fine that way if we can just tell tracker-extract
that when it extracts a file /tmp/mtp-martyn-001-foo.pdf that it will
be adding metadata for a file that will end up as
/home/martyn/Documents/foo.pdf after rename() of the MTP daemon.

I think right now that's not possible, but I think the adaptations
needed for tracker-extract to support this would now, after the
passive-extract work, be minimal. Right?

Note that it should really be <subject> based and not nie:url based.

Because /tmp/something-to-extract might not be a file that will end up
on the filesystem, but instead a chunk of data about a website.

So it's not file:///home/martyn/Documents/foo.pdf that needs to be
passed but the <subject> for the nie:InformationElement created for
foo.pdf that will end up, after rename(), in your $HOME/Documents.

Having said all that, for the external-crawler work I recently did 
(where external data sources push information through the 
libtracker-miner stack to be indexed) could benefit from reusing
the extractor work.


Yep.

I hasten to add, the ontology is usually most closely related to
this area of the code and where we see the most inconsistencies or
bugs due to broken ontology use.


Yes, the ontology must be in good hands for sure.

o. tracker-miner-fs accepts (in implementation) that others can
provide metadata (by integrating with tracker-extract or not) and
that it should not interfere (this is already somewhat in place
by using the graph support - our insert-or-replace sentence
already only replaces in our own miner-fs' graph only, giving
precedence to other graphs) It deals with metadata so it too
depends on libtracker-sparql and on Nepomuk (managed separately)
(like any other application)


Yea I agree.

libtracker-sparql + tracker-store: Allowing multiple ontologies
to be used. Applications don't care about tracker-store. They
just want an API to launch their SPARQL and SPARQL INSERT queries
on (and that's really it). They also want GraphUpdated, which is
problematic as this would need to be separate per ontology too
(fair enough).


I agree, and with any luck this should make sandboxing or
isolating ontology testing or data sets much easier.


Exactly.

tracker-extract separate: Allowing MTP daemons to enrich
metadata themselves on a file in /tmp before doing the rename()
to the final destination in $HOME. Allowing them to control the
metadata insertion instead of letting inotify of tracker-miner-fs
picking up the file after rename (metadata upfront the file being
ready). To indicate that the file isn't ready we have
tracker:available property.


You should know, this is already possible and I know of real use
cases doing this too.


Yep. Except that its API isn't really declared "public" yet. And it
needs to allow passing the <subject> of the nie:InformationElement,
and work independent of GraphUpdated if so instructed.

All small adaptations afaik.

Nepomuk separate: Sharing the ontology with KDE desktop, without 
GNOME's politics interfering of trying to dominate needlessly
the processes (which, whether GNOME people like this or not,
would imply that KDE simply wouldn't use it). Where this gets
hosted? FDO? nepomuk-desktop.org? Jesus, I don't care.


I'm all for sharing, but our situation has always been slightly 
different, we have a lot of extensions and things which the
original ontology doesn't have, so we can't strictly follow it
anyway. I don't know how this will sit with the KDE folk if they
want to use Tracker's ontology.



You can add the extensions in a separate .ontology file.

For example we can add tracker:indexed and tracker:notify to an
.ontology that only we ship. Our current code even already supports that.

Imagine you have this in the upstream Nepomuk's nie.ontology:

nie:title a rdf:Property ;
        rdfs:label "Title" ;
        rdfs:comment "The title of the document" ;
        rdfs:subPropertyOf dc:title ;
        rdfs:domain nie:InformationElement ;
        rdfs:range xsd:string .

Then we can add a tracker-nie-extensions.ontology with:

nie:title nrl:maxCardinality 1 ;
        tracker:fulltextIndexed true ;
        tracker:weight 10 ;
        tracker:writeback true .

That works.

[cut agreement]

So we outsource this to a more competent team who care about
more than like we do about just our implementation of it.


Usually changes to the ontology are closely related to
tracker-extract and extended metadata OR updates in the spec. I
don't think we need to outsource this, we just make it a separate
project and let people get involved, like we did with libmediaart.



Yes initially for sure. The ontology must certainly be in competent
hands that care about us. And at all times, afterwards, must we be
prepared to fork it back into our own project (if incompetent people
take it over). But we can also trust people to do the right thing.

FAQ - ---

Q: Would this be a split of the project? A: Yes, I guess. In four
parts (Nepomuk, libtracker-sparql, tracker- extract,
tracker(-miners))


Maybe it's boring, but I would just have a 'tracker-data-miners'
project and have in it:

libtracker-miner libtracker-extract tracker-miner-fs 
tracker-miner-apps tracker-miner-user-guides tracker-miner-rss 
tracker-extract


Sounds good. Maybe *tracker-extract separate for MTP daemons that
don't want a dependency on all the other stuff? But ok like this too.

The hard part is, a lot of those components depend on
libtracker-common, a private library and what would likely be part
of the 'libtracker-sparql' project.


That should probably become a static compiled .a library ...

Not sure if there's that many code that is worth putting in a .so? If
so, so be it, we just have a libtracker-common then.

Q: Does it matter? A: No, I guess (we'll still love each other on
#tracker and tracker-list. Same maintainers, same overall
project, same goals)


Actually I prefer this approach, the pace of development is so
different in different areas of the code base that this would make
things simpler for me as a maintainer.


Yep.

Q: Are you dangerous? Splitting is bad! bad bad bad! You witch! 
A: Very


No it isn't.

Actually, smaller modular based approaches work much better in
general because they have a specific purpose.


Exactly. I'm glad we're agreeing on this!

The smaller the component the smaller the risk of massive
catastrophe in any circumstance.


Yep.

There might be some version turbulence but other than that :)


Nope. We all do semver. Be nice to each other and do semver.

Q: But I use tracker-store's Resources DBus API to do Query and
Insert. And you want me to use libtracker-sparql then, rigth?
Will that make my beloved DBus API on tracker-store disappear? A:
Yes. You should never have used that one in the first place. The 
only public API that we should support is libtracker-sparql (and 
GraphUpdated on Resources, but I guess we'll bring that as a
signal on the TrackerSparqlConnection to libtracker-sparql too)


Not sure I agree with this point. But it's not insurmountable.


nod.

Really glad I see so many areas of agreement on this!

Kind regards,

Philip



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.20 (MingW32)

iQEcBAEBAgAGBQJUGc3PAAoJEEP2NSGEz4aDoTwIAMPd3mzpFEAlKfnux0st3opd
71Xkb9LoJgpi0bxiUQMQuEUla/MlFEHyZtrwlq2PmUDDTay+VwNG66AciO3Bec3l
YEQXbod8PdR8+vuZNl0VOFQwEM07SZgSWjvtFOn02XQZk7e4Qy65El9+xrkSpaU6
8ZuHK+7Lb5/ABe+1PiufkXfPx3T3aICCu5wBrldG6OsNK704ZiVzm4cQ+rLFkgE5
WlJr8ch6VvpP7s2UZAhLNqt+H0/GyRbNotxuFEsyig3hM71Re16LAGMdIFqssemu
Vpas4WAlJS0DrTUSPz3N6DjN4uCVIB4RaZmcG811gLV4/nCwDkn2F/dAmt3wkW8=
=4PAJ
-----END PGP SIGNATURE-----

Follow-Ups:
- Re: [Tracker] The Utopian idea, Tracker as it should be
  - From: Ivan Frade

References:
- [Tracker] The Utopian idea, Tracker as it should be
  - From: Philip Van Hoof
- Re: [Tracker] The Utopian idea, Tracker as it should be
  - From: Martyn Russell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]