Re: GSoC Weekly Report



On 8/19/07, Arun Raghavan <arunissatan gmail com> wrote:
> Hello All,
> This week I've been working on the new TextCache implementation that
> I'd mentioned the last time (replacing the bunch of files with an
> Sqlite db).
>
> Making an Sqlite db with just the uri and raw text caused an almost 3x
> increase in the text cache size (3.6 MB (on-disk) vs. almost 15MB in
> my test case). This despite the fact that the size of the raw text was
> only 7.9 MB. I need to figure out why this happens. In the mean time,
> I also implemented another version of this which stores (uri, gzipped
> text) pairs in the Sqlite db instead of (uri, raw text). Surprisingly,
> this actually seems to work very well (the db for the test case
> mentioned shrunk down to 2.6 MB, which is just a little more than the
> actual size of the compressed data itself).
My first impression on this is that Sqlite is probably building an
index for the raw text data. where as the compressed data is simply
treated as a binary 'glob'. I'm not 100% sure of the table definitions
that your using, or exactly how much (in terms of Indexes) sqlite does
automatically, but that seems like the most likely culprit. As we
already have our own system for searching text ;) if you could find a
way to force sqlite to not index the table's raw text column, you
could probably get more sane numbers regarding the database size.
However, its possible, its just how sqlite handles text content, and
the gzipped text is the best way to go. The other thing to test is how
this is handled in far larger situations. Is it possible that the
first 1000 rows are very expensive, but when we scale to 50000 rows,
we see only a minute increase in size?

>
> Performance numbers on a search which returns 1205 results are below.
> I basically ran the measurements twice -- once after flushing the
> inode, dentry and page cache, and another time taking advantage of the
> disk caches.
>
> Current TextCache:
> no-disk-cache: ~1m
> with-disk-cache: ~9s
>
> New TextCache (raw and gzipped versions had similar numbers):
> no-disk-cache: ~42s
> with-disk-cache: ~10s
>

Very cool/ interesting. One of the important cases to test here is
multiple successive queries. Think like deskbar as a user types
completion, how does such a system fair when it gets 15 or 20 queries
back to back. Does the compression difference factor in then?

> One very important factor remains to be seen -- memory usage. I am
> working on figuring out what the impact of the new code on memory
> usage is. Numbers should be available soon.
>
> On the Xesam front, I will be updating the code tomorrow,day-after to
> reflect the latest changes to the spec.

I know the Google SoC is over, and its completely ok if your too busy
to complete these tests, but if would be awesome if you could provide
a patch to the list so we can not only see exactly what you were
doing, but so that someone else might finish up your work and/or get
it merged in and ready for 0.3.0.


> --
> Arun Raghavan


-- 
Cheers,
Kevin Kubasik
http://kubasik.net/blog



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]