Changes in snippets API
- From: Debajyoti Bera <dbera web gmail com>
- To: Beagle <dashboard-hackers gnome org>
- Subject: Changes in snippets API
- Date: Mon, 20 Aug 2007 23:04:00 -0400
Joe already leaked the news of the changes in the the handling of snippets. So
a brief mail about the changes is in order. For the impatient, there were
changes in how the snippets were generated, how to request a snippet and how
the response snippet look like. Backward compatibility has been ensured by
providing wrapper methods/properties and so all the existing C#/C/python
clients should work as they are.
Beagle API allows the clients to request a snippet for a particular result of
some query. In response, the clients used to get an HTML string representing
some occurrences of the query terms in the data; the occurrences were marked
as bold and in different colours (using HTML tags). The snippet being in
HTML, it could be display directly if the underlying widget understood HTML
or modified to display it correctly. It was easy to get back the snippet but
hard to manipulate it since it was all pre-formatted with HTML.
I made two major changes:
1) Added an option "FullText" to the snippet-request to get back the _entire_
(cached) text for that result. Just like google page-cache.
The entire text is currently _not_ marked with the occurrences of the query
terms for performance reasons. Any client can mark the occurrences roughly by
scanning the text and marking the words which match any of the
query.stemmed_text words.
2) The snippet is not returned as a single HTML string. Instead it a list of
SnippetLine. Each snippetline denotes a snippet containing one or more
matches. The snippetline stores the line number (so that supported
applications can scroll the file to the right places) and a list of fragments
giving a continuous sequence of text containing one or more matches. The
fragments can be concatenated together to get the matching fragment. Each
fragment stores a text and a QueryTermIndex index. If the fragment is part of
the sentence before, after or between the matches, it has QueryTermIndex=1;
if the fragment is match, it has QueryTermIndex = the index of the matched
stemmed_text in query.stemmed_text array. If there are multiple but far apart
matches in a single line, there could be more than one snippetline.
It is confusing but contains all the information about snippets that any
client can possibly need. And more important, no formatting. It is also easy
to join the fragments and snippetlines with any kind of formatting to create
the complete snippet.
For e.g. if the query was "good abcd 1234", and there was line 17 looked like:
"abcd is a good desktop search application but it cannot return results for
queries like 1234 which is pretty lame", then this would produce this:
snippets = [snippetline1, snippetline2]
snippetline1 = (line=17, [fragment1, fragment2, fragment3, fragment4])
fragment1=(index=1 , text="abcd")
fragment2=(index=-1, text=" is a ")
fragment3=(index=0, text="good")
fragment4=(index=-1, text=" desktop search application")
snippetline2 = (line=17, [fragment5, fragment6, fragment7])
fragment5=(index=-1, text="for queries like ")
fragment6=(index=2, text="1234")
fragment7=(index=-1, text=" which is pretty")
If FullText is set, a snigle snippetline is returned with line=1 and a single
fragment containing the entire text.
Using the earlier SnippetResponse.Snippet/
beagle_snippet_response_get_snippet() methods return an HTML coloured string
just like before: "<font XXX><b>abcd</b></font> is a <font
XXX><b>good</b></font> desktop search application ... for queries like <font
XXX><b>1234</b></font> which is pretty..."
Not done yet:
- The C# BeagleClient API allows clients to access the detailed structure of
the snippet i.e. access to the snippetlines and fragments. But the
libbeagle/python bindings weren't updated to open up the detailed structure
in the API. So, as of now libbeagle/pybeagle clients can get only the
coloured HTML string as snippet.
- beagle-search still contains the hack where it changes the HTML snippet
directly to change its formatting.
- Somehow integrate "Show cached text" into the UI.
Any patch is welcome.
- dBera
PS1: There was another change where the snippets are now generated on the fly
from the textcache file right before being shipped to the client.
PS2: TextCache, the major source of the snippets, is also undergoing a lot of
changes. More on that later.
--
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]