Re: [Tracker] Request for review, API allowing external software like MTP daemons to get metadata as correct sparql insert



On 08/01/13 11:13, Philip Van Hoof wrote:
On Fri, 2013-01-04 at 15:20 +0000, Martyn Russell wrote:
On 03/01/13 21:30, Philip Van Hoof wrote:

I'll reply to the most important parts of your review:

Thanks,

8. This looks a bit wrong to me. First, we always set the file from the
temp_file variable given, but then we might set the url to be something
different based on dest_url. Not only is this confusing but I expect it
to cause problems inside tracker if the file used for the URL != file
used for the PATH. I can see why you want to do this, but we didn't do
it this way round with Nokia, we issued a separate update for the file
details when it actually was saved as something else. The problem I see
is that you could have a query which returns urn->path as /foo and
urn->uri as file:///bar and that's clearly wrong:

Yes the separate update was/is the solution for Nokia's N9. The idea
here or now is that no separate update would be needed at all with this
solution: the MTP daemon does both the extraction and the initial sparql
insert in one pass.

And then the MTP daemon does a rename() call from tempfile to dest-url.
Completing the file transfer.

As long as what's on the disk is represented in the DB, that's fine. At the moment we don't have any cases (I know of) where the file is available in the DB but not on the disk (or visa versa).

This can lead to incorrect results for applications using Tracker which we want to avoid.

That sparql insert can happen while the download takes place (allowing
the MTP daemon to insert protocol based metadata before file data is
transferred and available - for extraction).

There are a number of issues with this approach to be considered. For example:

1. How does anyone querying the information know when the file is ACTUALLY available? tracker:available?

2. What if you don't have all the information until it's downloaded and it's needed for querying ahead of time? This is usually the biggest problem here. Simple things like file size or even specific metadata can be important enough here.

15. I've not checked the whole on_fileinfo_received() part because it's
really complex in the miner-fs already and we were fixing problems there
for years related to parent existence and creation prior to insertion of
a child path. This worries me about this branch because I get the
impression from quick review that it's not really addressed fully. You
also seem to allow parent URNs to not exist before adding a child to the
database. That's not how we've done things in the miner-fs and this
really breaks what people using the DB can expect. Correct me if I am wrong.

Right, parent URNs aren't yet created with this solution and should
probably be.

This is a requirement I would say. It's a hard problem to solve well too, just speak to Sam about his recent work to improve this and recursive issues :)

I guess putting INSERT queries upfront the metadata one needs to be
possible as well as getting a separate query for in case the MTP daemon
wants to do multiple passes. ie. First protocol based metadata insert,
which needs the parent URNs at that point and which doesn't yet need the
file content to be available, then extraction based metadata insert; for
which the file content must be available.

Not really. If that insert fails, then you're not guaranteeing the parent is inserted, which we do at the moment.

Separate queries for the parent are really the way to go here. Once that's done you have to handle failures and retries before the child can be added, etc. This is why the miner-fs has a lot of code in this area to handle this stuff. Ideally, it would be good to make use of that code here so we don't have to fix issues we've already fixed in the miner-fs.

And what this creates must also be compatible with what miner-fs will
otherwise create.

Yea I agree.

The idea here is that while a (large) file is being transferred,
protocol based metadata is already in the Nepomuk store available for
applications that are interested in tracker:available-false 'to be
deployed' data. ie. a DivX or DVD being transferred and a movie
application saying: "DVD with title XYZ with actors ABC and written by
DEF is being transferred to your phone (20% completed)". And After that
an extraction to get extra metadata that isn't included in the protocol
based metadata (and all that without needing miner-fs).

We did actually have support for this in the past with inotify and half downloaded files, but Nokia decided not to "notify" about half downloaded files after a while. I think this was an area which changed a few times due to requirements. In the end performance was one of the reasons (due to spamming of updates with everything else going on at the same time).

16. I presume the tracker-storage.[ch] addition to libtracker-extract is
a straight copy of the files from libtracker-miner? I am not sure if
this is such a good idea - but I've not been sure what to do here for a
while now. From a quick git grep on the code, it seems we only use it in
tracker-writeback, tracker-extract and tracker-miner-fs, and since
miner-fs already links with libtracker-extract, it looks like the
logical choice to move it there.

Right

22. The reason I asked for a sync API is so we could use it in
tracker-sparql ;)

Not sure if a sync api is really needed since it would just wrap a
GMainLoop anyway. But could be done for convenience.

I would do it for convenience.

Thanks Philip ;)

--
Regards,
Martyn

Founder and CEO of Lanedo GmbH.



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]