Re: [Tracker] not indexing text from PDF files



On 11/01/2013 05:37 AM, Martyn Russell wrote:

So there are a few things... first, I would check that the file is
indexed before searching ... if it isn't then you won't find those
words.

I thought I confirmed it was indexed with my previous example
tracker-search that returned a result for it -- albeit by it's name, not
it's contents.

$ tracker-search --disable-color -l 1000 pdf
Results:
...
  file:///home/brian/tmp/2013-10-26-3.pdf
  2013-10-26-3.pdf


Note that tracker-extract does not index the file, it just
extracts the information,

Yes, understood.  I just wanted to confirm that the extracter was
finding something for the indexer to index.

usually tracker-miner-fs calls APIs to talk to
tracker-extract. The example above is really just a way to see what we
find in a file you specify on the command line.

Indeed.  That's exactly what I wanted.  The first step in debugging the
processing chain.

Is the file above file:///home/brian/tmp/2013-10-26-3.pdf ?

Yes.  With the real (and confidential) content replaced in the example
with the "list\nof\nwords\nseparated\nby\ncarriage-returns\n".  I guess
you will just have to trust that I used one of the words from the real
nie:plainTextContent in my search query.  :-/

To make sure the file is indexed, you can use tracker-control -f
$FILENAME and it should take care of that for you.

OK.  Let me give that a whirl.

$ tracker-control -f tmp/2013-10-26-3.pdf
(Re)indexing file was successful

And a search for strings in that file were successful.  So was it always
there or did the "tracker-control -f ..." cause it to be there?  Let's
find another example file to work on...

I had another copy of the same file in the same directory named
2013-10-26-2.pdf (-2 instead of -3) and the search for the string
"RT0001" only returned the -3.pdf result.  Then I ran:

$ tracker-control -f tmp/2013-10-26-2.pdf

and then magically (well, not really so much magic) the -2.pdf file was
being included in the results.

So, there are definitely PDF files with text on my filesystem that are
not indexed but will get indexed when "tracker-control -f" is pointed at
them.

I guess there is not much to do here but wipe the whole database and
start a brand new scan from fresh, right?  I mean otherwise who knows
what's been properly indexed and what has not.

Just to make sure I am doing it correctly, what method would you like me
to use to do that complete wipe/rescan?

We are on IRC if you need more help... let us know :)

Will do, thanks.

b.



Attachment: signature.asc
Description: OpenPGP digital signature



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]