Changes in snippets API



Joe already leaked the news of the changes in the the handling of snippets. So 
a brief mail about the changes is in order. For the impatient, there were 
changes in how the snippets were generated, how to request a snippet and how 
the response snippet look like. Backward compatibility has been ensured by 
providing wrapper methods/properties and so all the existing C#/C/python 
clients should work as they are.

Beagle API allows the clients to request a snippet for a particular result of 
some query. In response, the clients used to get an HTML string representing 
some occurrences of the query terms in the data; the occurrences were marked 
as bold and in different colours (using HTML tags). The snippet being in 
HTML, it could be display directly if the underlying widget understood HTML 
or modified to display it correctly. It was easy to get back the snippet but 
hard to manipulate it since it was all pre-formatted with HTML.

I made two major changes:
1) Added an option "FullText" to the snippet-request to get back the _entire_ 
(cached) text for that result. Just like google page-cache.
  The entire text is currently _not_ marked with the occurrences of the query 
terms for performance reasons. Any client can mark the occurrences roughly by 
scanning the text and marking the words which match any of the 
query.stemmed_text words.
2) The snippet is not returned as a single HTML string. Instead it a list of 
SnippetLine. Each snippetline denotes a snippet containing one or more 
matches. The snippetline stores the line number (so that supported 
applications can scroll the file to the right places) and a list of fragments 
giving a continuous sequence of text containing one or more matches. The 
fragments can be concatenated together to get the matching fragment. Each 
fragment stores a text and a QueryTermIndex index. If the fragment is part of 
the sentence before, after or between the matches, it has QueryTermIndex=1; 
if the fragment is match, it has QueryTermIndex = the index of the matched 
stemmed_text in query.stemmed_text array. If there are multiple but far apart 
matches in a single line, there could be more than one snippetline.

It is confusing but contains all the information about snippets that any 
client can possibly need. And more important, no formatting. It is also easy 
to join the fragments and snippetlines with any kind of formatting to create 
the complete snippet.

For e.g. if the query was "good abcd 1234", and there was line 17 looked like:
"abcd is a good desktop search application but it cannot return results for 
queries like 1234 which is pretty lame", then this would produce this:

snippets = [snippetline1, snippetline2]
snippetline1 = (line=17, [fragment1, fragment2, fragment3, fragment4])
fragment1=(index=1 , text="abcd")
fragment2=(index=-1, text=" is a ")
fragment3=(index=0,  text="good")
fragment4=(index=-1, text=" desktop search application")
snippetline2 = (line=17, [fragment5, fragment6, fragment7])
fragment5=(index=-1, text="for queries like ")
fragment6=(index=2,  text="1234")
fragment7=(index=-1, text=" which is pretty")

If FullText is set, a snigle snippetline is returned with line=1 and a single 
fragment containing the entire text.

Using the earlier SnippetResponse.Snippet/ 
beagle_snippet_response_get_snippet() methods return an HTML coloured string 
just like before: "<font XXX><b>abcd</b></font> is a <font 
XXX><b>good</b></font> desktop search application ... for queries like <font 
XXX><b>1234</b></font> which is pretty..."

Not done yet:
- The C# BeagleClient API allows clients to access the detailed structure of 
the snippet i.e. access to the snippetlines and fragments. But the 
libbeagle/python bindings weren't updated to open up the detailed structure 
in the API. So, as of now libbeagle/pybeagle clients can get only the 
coloured HTML string as snippet.
- beagle-search still contains the hack where it changes the HTML snippet 
directly to change its formatting.
- Somehow integrate "Show cached text" into the UI.
 Any patch is welcome.

- dBera

PS1: There was another change where the snippets are now generated on the fly 
from the textcache file right before being shipped to the client.
PS2: TextCache, the major source of the snippets, is also undergoing a lot of 
changes. More on that later.

-- 
-----------------------------------------------------
Debajyoti Bera @ http://dtecht.blogspot.com
beagle / KDE fan
Mandriva / Inspiron-1100 user



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]