Re: Desktop Crawler's feature comparison
- From: "D Bera" <dbera web gmail com>
- To: "John Smith" <the real monkey d luffy gmail com>
- Cc: dashboard-hackers gnome org
- Subject: Re: Desktop Crawler's feature comparison
- Date: Thu, 6 Mar 2008 08:26:00 -0500
Hi,
> Jindex. Which would then be added to wikipedia.
Nice effort. This will be very useful.
> I would ask your help to tell me what features are implemented in your
> tool (1) or are foreseen in the future... these are just a couple of
> Yes or No questions, so it's brief.
Answers inline. I have added comments wherever applicable for the
readers of this mailing list. For others, some of the features that
are not implemented are left out due to lack of demand and could be
easily be added if required.
For the general search syntax question, I am assuming you are
referring to fulltext searches. Certain metadata (e.g. the camera
model) is stored as a full string and can only be searched by using
the full name of the model. These are called "keyword searches".
> PS: I'm aware that the data crawler uses different backends for the
> different file types, in that case, please refer the backend when
> appropriate. For example, "PDF indexing capabilities is limited by
> xpdf. It does not recognize words with hyphes."
Beagle means a different think for the word "backend". So I will
instead use "external app" for this. If I use backend below, it will
denote the data sources (e.g. user files, manpages, evolution emails,
kaddressbook addresses, opera browsing history etc.)
> 01) Regular expressions (e.g.: com*on st?ff [A-F] (this | that))
Partially.
Only wildcard (*) query terms supported for full text searches. OR
('|') is supported partially.
Some of the other simple ones could be added if needed, but not
planned for. Full regex search might be very hard or impossible.
> 02) Boolean operators (+and -not)
Yes.
AND is the default so "+" is not needed and ignored.
> 03) Searching non-alphanumeric characters, maybe through the use of
> backslash (e.g. := + ? { ] &)
No.
If you mean just searching for "&" or "{" then no. However "a+b" can
be searched and will return matching files with "a+b" in them (but
will also return files with "a-b" i.e. the non-alphanumeric character
is not matched.
> 04) Exact sentences using double quotes (support for line breaks?
> hyphenization? text in columns?)
Exact sentences can be matched using double quotes. Hyphenated words
are split at the hyphen. I dont understand the "text in columns" part.
Just to make it clear, non alphanumeric characters are generally
dropped but their position is remembered. So "abcd\n1234-5678" will
match "abcd?1234?5678" where "?" denote any single character.
> 05) tex, pdf and ps (index sentences correctly even when text is
> organized in columns or uses hyphens; this is common in scientific
> articles using the pdf format)
tex - yes. pdf - yes (using xpdf). ps - not out of the box, but any
ps-to-text external app can be used.
I dont understand the "columns" part again. I am not sure about the
"hyphens": as you say above xpdf does not play well with hyphens and
probably hyphens in tex files are replaced by "?".
About searching ps files, please note that ps is a hardware printing
format i.e. it does not know what is a word, what is a sentence or
what is a paragraph. The only real way to get text out of a ps file is
to print it (that is exactly how the various ps-to-text apps work).
> 06) Different encoding and languages (ascii, utf8, japanese, etc)
Planned.
Currently, its utf8 by default if the encoding is not specified for
the file (some files e.g. html files can specify the encoding in their
metadata).
> 07) Index archive files (tar, bz2, rar, 7zp, etc) recursively
Yes for tar, bz2, zip and gzip. No for rar and 7zp, although we index
7zip compressed manpages.
> 08) Index simultaneously with and without stemming (for example,
> flooring, floors, floored would all be transformed to floor)
Full text searches are always stemmed. Keyword seaches are never stemmed.
> 09) Use of tags to better organize data (allows the user to have
> collections)
No native tag support.
Actually, tags can be set natively but not exposed (not meant for end-users).
> 10) Restrict search to specific directories or tags
No native tagging, so no native tag searches. Can read and search
external tags (f-spot, digikam tags).
Directory specific search partial. Searching by specifying a directory
will only search in that directory (or directories if the name matches
multiple actual locations) but no recursively in its subdirectories.
> 11) Provide thumbnails for images and video (allow specifying number
> of thumbnails for video and time interval between thumbs)
No.
The search service does not generate thumbnails. The search GUIs use
the thumbnailers of the respective DEs (e.g. beagle-search uses the
GNOME thumbnailer, kerry uses KDE thumbnail API).
> 12) Image and video content search (something like imgseek... maybe
> better or maybe it could use it as backend)
Sorry, what is imgseek ? Something like OCR but for more things than
mere text in image ? In that case, no.
> 13) Index removable media (making possible to index and organize data
> in dvds or external hard drives)
Partial.
Can be done with some work, better support planned.
> 14) Databases supported
What do you mean - indexing text in databases or using different
databases for storing the indexed data ?
> 15) Allow having different databases catalogs (usefull for searching
> collection of external devices)
Related to (13). Partial, as I said. Different catalog is and will be supported.
> 16) Checksum (allows finding duplicate files)
No.
We dont want to read large 4GB video file when we can find its
metadata only in a small part at the beginning or the end.
> 17) Other aspects worthy of mention
Beagle (and possibly other crawlers too) is more than a file system
crawler and cataloging system. Besides indexing user files, beagle
likes to extract and index data from other possible data sources like
emails, browsing history, notes, contacts, scheduler (and more). See
http://beagle-project.org/Supported_Filetypes for the different data
sources. Beagle architecture is extensible, comes with searching and
indexing API and several tools which means some of the features
mentioned above can be obtained by using them (e.g.
http://beagle-project.org/ExternalFiltersRepository can be used to
index any file if there is any external app that can dump its text
content). Besides text content, beagle (and other crawlers too) index
and search the various metadata, e.g. author, keywords, comments,
image resolution, EXIF metadata, ID3 tags etc.
I can't think of anything else relevant to your questions right now.
- dBera
--
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user
[
Date Prev][
Date Next]   [
Thread Prev][
Thread Next]   
[
Thread Index]
[
Date Index]
[
Author Index]