Re: Two questions about medusa

From: Curtis Hovey <sinzui cox net>
To: desktop-devel-list <desktop-devel-list gnome org>
Subject: Re: Two questions about medusa
Date: 06 May 2003 19:14:53 -0400

On Tue, 2003-05-06 at 07:00, David C Sterratt wrote:
>  > I've toyed with the plugin idea as it might get some things done
>  > quickly.  I'd like to bring some intelligence to what is indexed,
>  > and the plain text indexer cannot handle that.  XML content like
>  > OpenOffice is very rich and it would loose it some of it's meaning
>  > and relevance if it were crudely converted to plain text.  PDFs
>  > don't have any meaning. They would be fine in your solution.  We
>  > need to weigh the capability of adding ad hoc indexers verses their
>  > potential dependencies.
> 
> How would one use the rich semantics in openoffice files as search
> terms?  At the moment, the searching semantics only allow for
> "contatins any or all of".  I suppose you could extend them to things
> like "author matches", but that might be confusing if some of the
> other documents you're searching don't have the rich semantic
> information, since you wouldn't be able to retrieve (say) PDF files
> written by a particular author, but you would get OO files written by
> them.  

I wasn't think about search terms, but ranking of relevance.  Medusa
returns the matches, but there is no attempt to locate the best match,
or order rank from best to worst.  One method to accomplish this is to
record not just the words in a document, but their incidence too.

The problem is that the words in titles, headers, keywords, abstract,
etc, should have more weight that a word in a paragraph.  Of course 
different documents different rules, making storage and retrieval
awkward.

One method that I've employed elsewhere is to use the content indexer as
an equalizer.  Words in a title are duplicated ten times, those in a
heading 5 times, and those in a summary 2 times, then the text is passed
to the plain text parser.  Also, rich documents must be stripped of
meaning words.  You'll see my point if you indexed you Home directory. 
Search for table (gnome-search:[file:///]content includes_all_of table)
and msearch returns most of epiphany's page cache and thousands of pages
from my javadoc (11,403 files).  

The content of plain text, or bad HTML (just table and font tags), is
ambient.  Nothing stands out so only the occurrences of words can
count.  That's not to say context of words cannot be used to influence
ranking.  A smarter plain text indexer might have some rules to count
the first lines more than the last, or any line preceded by a blank
line. As for you PDF example, There is some meta data in them, and the
font size of the style could be used to influence the ranking, but that
is hard.

>  >> Secondly, it looks as though medusa can't search for phrases or
>  >> words including globbing characters.  Is that right?
> 
>  > Yeah.  That is a weakness, and a difficult one to overcome. I can
>  > image how to add the phrase capabilities by adding some additional
>  > index information.  The globing (* and ?) could be done with some
>  > ungraceful hacks--but I think we would need to get the OR
>  > functionality working.
> 
> Would it be possible to combine another, more sophisticated full text
> indexer library with the medusa code for the filesystem properties of
> files?  Or if not use a library use some code?  I've found at least
> one indexer that does wildcard and phrase searching (Swish-e), but it
> can't do incremental reindexing.

I'm very willing to link to, borrow, or copy from other libraries.  I do
wish to keep Medusa's dependencies low to make it easier for developers
to incorporate it into their applications.  Evolution developer's once
considered using Medusa to do it's indexing, but it wasn't ready and it
was very neglected.  I'd like GNOME to have one good indexing solution
and apps like Evolution or Rhythmbox use it.

As for globbing and phrasing, I think it is a storage versus performance
problem.  One method to address the phrase issue is to store the
position of a word with the file pointer, and comparing the sequence
after doing the intersection of the results.  Globbing is a simple
method that looks in the index for candidate words, then make a union of
a query for each of them.

Also, Medusa use DB1.  It might be time to consider a newer version that
offers better storage and retrieval methods.  I had to make a symbolic
link to get Medusa to compile after I upgraded to Redhat 9.

-- 
__C U R T I S  C.  H O V E Y____________________
sinzui cox net
Guilty of stealing everything I am.

Follow-Ups:
- Re: Two questions about medusa
  - From: snickell
- Re: Two questions about medusa
  - From: David C Sterratt

References:
- Two questions about medusa
  - From: David C Sterratt
- Re: Two questions about medusa
  - From: Curtis Hovey
- Re: Two questions about medusa
  - From: David C Sterratt

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]