Re: [Tracker] REVIEW: 'follow-symlinks' branch



On 17/03/14 15:40, Aleksander Morgado wrote:
On Mon, Mar 17, 2014 at 10:40 AM, Martyn Russell <martyn lanedo com> wrote:

Hi Aleksander,

The key point here is to decide when to follow the symlink, because by
default we shouldn't be following all.

I agree.

What if there is a symlink to a file which is configured not to be
indexed at all? The git-annex setup case would be this case actually,
as the symlink points to a path within the .git directory (hidden, not
indexed by default).

I was discussing this with Carlos today, saying that we should also have the ability to reference to one file in the DB and link to that to avoid bloat for each symlink. That's more work than my current branch offers right now.

As such, it's likely to be shelved until after the release.

What if the symlink points to a directory which is already in the list
of paths to index? Wouldn't that re-index all the files within that

Well, currently, we resolve duplicates BEFORE crawling. This is one case that would creep through the existing safe guards. Though I suspect it means that it would just be an extra SPARQL check before we process the file. Still, it's not ideal.

We do evaluate environment variables before we resolve duplicates too, we could do something similar for symlinks in the configured dirs. But that doesn't help us for dirs we find under those in the config.

path with another path based on the symlink? In a worst case, what if
the symlink points to the root path? Or what if you end up getting a
closed loop in the symlinks?

Yea, these are the cases I am more concerned about.

A maybe not very bad default could be to only follow symlinks if the
target is a file (i.e. skip symlinks to directories). That would

That's certainly a better approach, I agree.

handle the git-annex case at least...  I believe your patch does also
follow symlinks to directories, doesn't it?

Yes, I knocked it up in 5 minutes, it's not a final implementation, but I wondered if there was interest in this beyond the one bug report/RFE.

***

These are all the reasons why I think it's disabled right now.

If I run:

  $ find ./ -type l -print0 | xargs -0 ls -plah

On my $HOME directory, I do see a lot of links to duplicate files too, I think this would also bloat the DB. This is the case mainly because I have a lot of source repositories also.

I think it's hard to gauge how disruptive this can be for individuals and it's really an option for an experienced user / developer. That's why I would make it optional in the tracker-preferences or be very clever about when to follow symlinks (e.g. files, not dirs as you suggested).

--
Regards,
Martyn

Founder & Director @ Lanedo GmbH.
http://www.linkedin.com/in/martynrussell


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]