Re: Building off Medusa

From: "Manuel Amador (Rudd-O)" <amadorm usm edu ec>
To: Seth Nickell <snickell stanford edu>
Cc: sinzui cox net, desktop-devel-list gnome org, gnome-devel-list gnome org
Subject: Re: Building off Medusa
Date: Thu, 03 Apr 2003 18:31:22 -0500

Seth Nickell wrote:

 - There are important reasons why Medusa runs in user-space.

Security being one of them. But medusa in its current incarnation,simply isn't scalable without a lot of sysadmin effort (which rules outbroad implementantions and could hamper GNOME as a desktop platform).Imagine a 50-client NFS server, being simultaneously indexed by themedusas of each logged in user in each of the clients. Now imagine thesysadmins chalking it up to Medusa/GNOME.

If you
are really determined to do a system daemon, Medusa was already built to
do this and that aspect of Medusa could be revived. If you have further
questions why Medusa was steered down the path its currently on, I'd
love to discuss this with you further.

Thanks =) The reasons for the scale-down were discussed a couple monthsago.

 - Do not underestimate the number of issues and the amount of work to
recreate something like Medusa. It seems simple on the surface, doing it
well is very complex. I do not see even recreating Medusa as feasible in
the scope of a college project (even a big one).

We wouldn't be recreating medusa. We see ourselves as integrators. Wewould integrate existing software and write/adapt current userinterfaces to take advantage of this. I'm investigating Medusa as apossibility for an index/search service. I'm investigating Xapian too.The thing is, I agree completely with you, if we were to "redo medusaall over in C". but we won't. We don't have the manpower and we wouldfail the course.

Incidentally, the course is focused on providing business managementsolutions. Doing a medusa isn't one, so we need to provide a completeend-user corporate solution *and* sell it to at least one customer.

 - Medusa is a pretty clean codebase and would be relatively easy to
extend and change.
 - If you choose not to use Medusa and can "deliver", I'd certainely be
in favour of modifying gnome-search-tool to use your system.

This comment makes me extremely happy. I don't want to displace medusaat all, though. At this point, I'd like to know what can medusa do,specifically:

*relevance scores for returned documents: crucial for sorting documents

*full-text search using word stems: people don't really remember theexact spelling of a word

*full-text search phonetically: ditto
*incremental live indexing: updated indexes to the last half minute

*multiuser indexing: to provide for query returns which filter out whatusers can't see*metadata indexing: to search through files asking for "artist","author", or "album" (i'm listening to music =)*offline searches: to provide "volume indexes" (search your 50 CD-ROMswithout having them mounted)

The incremental live indexing isn't difficult. Medusa could use libfamto gather modifications to files, and reindex them. I got it prettysorted out, although perhaps a "medusa-modifyd" is needed, which putsfilesystem change manifests in a FIFO queue, and medusa reads the queueand reindexes the filesystem. This also helps for offline searches.The multiuser thing is crucial to a business setting. The metadataindexing is crucial for me (damnit i want to find my MP3).

But perhaps most interesting are the key enterprise features: Relayedqueries and rewritten responses. Instead of having the indexer indexthe NFS server, do NOT index it. Let the NFS server's index do it.Then, when a query is received in the search service, the search servicerelays the query to the NFS server. Finding out which volumes areNFS-mounted is dead-easy. A remote client would need that the searchservice rewrite its responses' path names (think Windows GUI toolsearching SAMBA server) so the client can open the files.

This requires per-volume tracking. You need to keep track of volumes,volume labels and files. This also would help the indexer avoidreindexing a volume when it's remounted somewhere else.

I can appreciate not wanting to use C. My concern if you didn't use C
would be ensuring that a C API was provided so we could integrate it
with GNOME applications. C is, for the most part, the "common
denominator" language on *nix.

Well, you're right. But I assure you that we wouldn't have time towrite a C library to connect to the search service. We intend to writean XML vocabulary, and let the clients build their queries in thatvocabulary and send them to the search service. We expect people tolink up with the search service in that fashion, and perhaps we wouldmake example freely licensed code available to ease that integrationwork. I think GNOMErs and KDErs won't have a problem, since bothplatforms have XML libraries. XML also grants us platform-independence,zero need to code, extensibility and backwards compatibility.

1) Medusa has not always been per-user (in fact, no released version of
Medusa has been per-user). My point is: a lot of work was done on Medusa
to verify that it was secure, make it work well as a system daemon
communicating with user processes, etc, and we still backed down from a
sytem daemon in the end after all that investment. Don't underestimate
the work involved on this point.

Definitely not. But as explained up, an enterprise knowledge miningsolution can't work per-user. Think of an attorney looking for aparticular contract in the company's file server. Now think of 50attorneys looking for different documents.

2) System indexes have a lot of scary security problems. You (or,
perhaps more pointedly, the Linux distributions you want to run your
indexer as "root") have to be confident that there is no way to crash or
confuse your indexer from user created files, file structures, etc. This
becomes a particularly serious issue if you want to have lots of
indexing "plugins" (for example, index the "metadata" from MP3s, AbiWord
documents, etc). Each of these plugins will need to meet that level of
security!

This can be alleviated:

* indexing plugins should be written in high-level, managed languages(python?). Exceptions should be caught and the program aborted.* communication among components should use XML. That way the parserscan throw exceptions and the communication can be aborted before anydamage is done.

I know what your fears are, and I fully share them. Malicious userscould injecct malicious files. And if the indexing job were done in C,I'd be scared shitless. But not so with managed languages. About 80%of security bugs can be slashed like that. After that, there's theissue of plugins relaying malicious data to the indexer, but if thecommunication is done in XML, malicious data might trigger an exceptionin the indexer, and the indexer would mark the plugin as bad, and keepon strolling.

3) While a user is logged in, it is highly desirable to index their data
much more frequently. This is easily accomodated with a user space
daemon but requires tricky (though not impossible) games with a system
daemon.

I fully agree that data should be indexed quickly. But why for loggedusers? Why not for all of them? It's not that hard. Files modified,and a couple of seconds later the index reindexes them, all with thehelp of FAM and perhaps a separate application (a file monitor queueingservice, which could also be a systemwide service, no seecurity risk inthat because it couldn't be polluted by malicious data). Key here isthat the index runs with nice -20, so no system performance impact.

4) You can't index as anything other than root because many interesting
user documents will not be world readable.

Exactly.

5) If you have a system index made as root, you need to implement a
search daemon that controls access to that information based on the
interested processes UID and the permissions relevant to each indexed
file. Also note that there can be discrepencies in security created in
between permission changes and re-indexes, which could possibly be a
concern on some systems.

Yes. We are counting on the need of implementing access controlcapabilities in the search daemon. Medusa already had that. About thepermission changes, that can be solved with FAM too. chmod on a file?reindex the file's metadata and presto.


The current planned Medusa approach, under consideration, is as follows:

 - Data in /home/username is frequently indexed by a user space daemon.
This is done while the user is logged in.
 - A system index is performed as "nobody", allowing searches for files
and information that everyone has read access to (such as man pages,
documentation, etc).

Except for corporate information that is visible only to members ofgroup "management" (fictional setting). Then management can't mine thatdata.

 - GnomeVFS integration and incremental indexing mean that as soon as a
file is changed the user-space indexing daemon is notified and
re-indexes just that file.

 - User space indexing means it is easy to get information on whether
the mouse and keyboard are in use (something that *was* done with the
system medusa indexer too, but was more tricky) and "back off" to
provide a responsive system.

You don't need to monitor for user activity. Merely setting a very lowpriority makes for a responsive system. The Microsoft Indexing servicefollows this approach.

 - Recently used documents (perhaps an extended version) allows the
medusa user-space indexing daemon to find new areas of the disk where
people keep files that the system indexer wasn't able to access. That
means that even if the files in /Music aren't readable by nobody, if you
access a file in /Music the user space medusa will find that directory
and start indexing it. (This is a touchy point, may not be good, may be,
hard to say)
* we don't want a hundred PCs indexing the NFS server each. we want thesearch service to delegate queries to NFS servers, so as to avoidnetwork load and wasted disk space
Yes, very important. Medusa currently avoids indexing NFS mountpoints,
but doesn't do anything to solve the "searching nfs mounts" problem.
However, there's no reason medusa can't be extended to do this. It will
certainely be easier then starting from scratch.
* as there is no documentation, we don't know if Medusa can indexgigabytes of files, extract their data and metadata, and provideless-than-10 second query response. Our initial prospect for thedatabase, PostgreSQL, can indeed provide less-than-10 second responsefor queries, provided the proper indexes are applied to the proper tables.
It would be quite possible to port Medusa to using a database as a
backend, or using database backends as an alternate source of
information. (BTW, you might consider looking at SQLite for the local
index case).

PostgreSQL was our choice. Full text indexing there. But Xapian shapesup as an amazing contender.

I'm assuming "enterprise-class" here is a euphamism for "networked".

plus sellable for lots of bucks.

=) luck.

Follow-Ups:
- Re: Building off Medusa
  - From: Seth Nickell

References:
- GNOME and advanced search indexes viability
  - From: Manuel Amador (Rudd-O)
- Re: GNOME and advanced search indexes viability
  - From: Curtis Hovey
- Re: GNOME and advanced search indexes viability
  - From: Manuel Amador (Rudd-O)
- Building off Medusa
  - From: Seth Nickell

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]