Re: Building off Medusa



Seth Nickell wrote:

 - There are important reasons why Medusa runs in user-space.

Security being one of them. But medusa in its current incarnation, simply isn't scalable without a lot of sysadmin effort (which rules out broad implementantions and could hamper GNOME as a desktop platform). Imagine a 50-client NFS server, being simultaneously indexed by the medusas of each logged in user in each of the clients. Now imagine the sysadmins chalking it up to Medusa/GNOME.

If you
are really determined to do a system daemon, Medusa was already built to
do this and that aspect of Medusa could be revived. If you have further
questions why Medusa was steered down the path its currently on, I'd
love to discuss this with you further.

Thanks =) The reasons for the scale-down were discussed a couple months ago.

 - Do not underestimate the number of issues and the amount of work to
recreate something like Medusa. It seems simple on the surface, doing it
well is very complex. I do not see even recreating Medusa as feasible in
the scope of a college project (even a big one).

We wouldn't be recreating medusa. We see ourselves as integrators. We would integrate existing software and write/adapt current user interfaces to take advantage of this. I'm investigating Medusa as a possibility for an index/search service. I'm investigating Xapian too. The thing is, I agree completely with you, if we were to "redo medusa all over in C". but we won't. We don't have the manpower and we would fail the course.

Incidentally, the course is focused on providing business management solutions. Doing a medusa isn't one, so we need to provide a complete end-user corporate solution *and* sell it to at least one customer.

 - Medusa is a pretty clean codebase and would be relatively easy to
extend and change.
 - If you choose not to use Medusa and can "deliver", I'd certainely be
in favour of modifying gnome-search-tool to use your system.

This comment makes me extremely happy. I don't want to displace medusa at all, though. At this point, I'd like to know what can medusa do, specifically:
*relevance scores for returned documents: crucial for sorting documents
*full-text search using word stems: people don't really remember the exact spelling of a word
*full-text search phonetically: ditto
*incremental live indexing: updated indexes to the last half minute
*multiuser indexing: to provide for query returns which filter out what users can't see *metadata indexing: to search through files asking for "artist", "author", or "album" (i'm listening to music =) *offline searches: to provide "volume indexes" (search your 50 CD-ROMs without having them mounted)

The incremental live indexing isn't difficult. Medusa could use libfam to gather modifications to files, and reindex them. I got it pretty sorted out, although perhaps a "medusa-modifyd" is needed, which puts filesystem change manifests in a FIFO queue, and medusa reads the queue and reindexes the filesystem. This also helps for offline searches. The multiuser thing is crucial to a business setting. The metadata indexing is crucial for me (damnit i want to find my MP3).

But perhaps most interesting are the key enterprise features: Relayed queries and rewritten responses. Instead of having the indexer index the NFS server, do NOT index it. Let the NFS server's index do it. Then, when a query is received in the search service, the search service relays the query to the NFS server. Finding out which volumes are NFS-mounted is dead-easy. A remote client would need that the search service rewrite its responses' path names (think Windows GUI tool searching SAMBA server) so the client can open the files.

This requires per-volume tracking. You need to keep track of volumes, volume labels and files. This also would help the indexer avoid reindexing a volume when it's remounted somewhere else.

I can appreciate not wanting to use C. My concern if you didn't use C
would be ensuring that a C API was provided so we could integrate it
with GNOME applications. C is, for the most part, the "common
denominator" language on *nix.

Well, you're right. But I assure you that we wouldn't have time to write a C library to connect to the search service. We intend to write an XML vocabulary, and let the clients build their queries in that vocabulary and send them to the search service. We expect people to link up with the search service in that fashion, and perhaps we would make example freely licensed code available to ease that integration work. I think GNOMErs and KDErs won't have a problem, since both platforms have XML libraries. XML also grants us platform-independence, zero need to code, extensibility and backwards compatibility.

1) Medusa has not always been per-user (in fact, no released version of
Medusa has been per-user). My point is: a lot of work was done on Medusa
to verify that it was secure, make it work well as a system daemon
communicating with user processes, etc, and we still backed down from a
sytem daemon in the end after all that investment. Don't underestimate
the work involved on this point.

Definitely not. But as explained up, an enterprise knowledge mining solution can't work per-user. Think of an attorney looking for a particular contract in the company's file server. Now think of 50 attorneys looking for different documents.

2) System indexes have a lot of scary security problems. You (or,
perhaps more pointedly, the Linux distributions you want to run your
indexer as "root") have to be confident that there is no way to crash or
confuse your indexer from user created files, file structures, etc. This
becomes a particularly serious issue if you want to have lots of
indexing "plugins" (for example, index the "metadata" from MP3s, AbiWord
documents, etc). Each of these plugins will need to meet that level of
security!

This can be alleviated:
* indexing plugins should be written in high-level, managed languages (python?). Exceptions should be caught and the program aborted. * communication among components should use XML. That way the parsers can throw exceptions and the communication can be aborted before any damage is done.

I know what your fears are, and I fully share them. Malicious users could injecct malicious files. And if the indexing job were done in C, I'd be scared shitless. But not so with managed languages. About 80% of security bugs can be slashed like that. After that, there's the issue of plugins relaying malicious data to the indexer, but if the communication is done in XML, malicious data might trigger an exception in the indexer, and the indexer would mark the plugin as bad, and keep on strolling.

3) While a user is logged in, it is highly desirable to index their data
much more frequently. This is easily accomodated with a user space
daemon but requires tricky (though not impossible) games with a system
daemon.

I fully agree that data should be indexed quickly. But why for logged users? Why not for all of them? It's not that hard. Files modified, and a couple of seconds later the index reindexes them, all with the help of FAM and perhaps a separate application (a file monitor queueing service, which could also be a systemwide service, no seecurity risk in that because it couldn't be polluted by malicious data). Key here is that the index runs with nice -20, so no system performance impact.

4) You can't index as anything other than root because many interesting
user documents will not be world readable.

Exactly.

5) If you have a system index made as root, you need to implement a
search daemon that controls access to that information based on the
interested processes UID and the permissions relevant to each indexed
file. Also note that there can be discrepencies in security created in
between permission changes and re-indexes, which could possibly be a
concern on some systems.

Yes. We are counting on the need of implementing access control capabilities in the search daemon. Medusa already had that. About the permission changes, that can be solved with FAM too. chmod on a file? reindex the file's metadata and presto.


The current planned Medusa approach, under consideration, is as follows:

 - Data in /home/username is frequently indexed by a user space daemon.
This is done while the user is logged in.
 - A system index is performed as "nobody", allowing searches for files
and information that everyone has read access to (such as man pages,
documentation, etc).

Except for corporate information that is visible only to members of group "management" (fictional setting). Then management can't mine that data.

 - GnomeVFS integration and incremental indexing mean that as soon as a
file is changed the user-space indexing daemon is notified and
re-indexes just that file.

 - User space indexing means it is easy to get information on whether
the mouse and keyboard are in use (something that *was* done with the
system medusa indexer too, but was more tricky) and "back off" to
provide a responsive system.

You don't need to monitor for user activity. Merely setting a very low priority makes for a responsive system. The Microsoft Indexing service follows this approach.

 - Recently used documents (perhaps an extended version) allows the
medusa user-space indexing daemon to find new areas of the disk where
people keep files that the system indexer wasn't able to access. That
means that even if the files in /Music aren't readable by nobody, if you
access a file in /Music the user space medusa will find that directory
and start indexing it. (This is a touchy point, may not be good, may be,
hard to say)

* we don't want a hundred PCs indexing the NFS server each. we want the search service to delegate queries to NFS servers, so as to avoid network load and wasted disk space

Yes, very important. Medusa currently avoids indexing NFS mountpoints,
but doesn't do anything to solve the "searching nfs mounts" problem.
However, there's no reason medusa can't be extended to do this. It will
certainely be easier then starting from scratch.

* as there is no documentation, we don't know if Medusa can index gigabytes of files, extract their data and metadata, and provide less-than-10 second query response. Our initial prospect for the database, PostgreSQL, can indeed provide less-than-10 second response for queries, provided the proper indexes are applied to the proper tables.

It would be quite possible to port Medusa to using a database as a
backend, or using database backends as an alternate source of
information. (BTW, you might consider looking at SQLite for the local
index case).

PostgreSQL was our choice. Full text indexing there. But Xapian shapes up as an amazing contender.

I'm assuming "enterprise-class" here is a euphamism for "networked".

plus sellable for lots of bucks.

=) luck.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]