Re: GSoC Weekly Report



A quick followup, some reading here:

http://www.sqlite.org/datatype3.html

provides some insight into how exactly sqlite3 stores values, I'm not
completely sure that such a loose typing system will greatly benefit
us when working with TEXT/STRING types, however, the gzipped blobs
might benefit from less disk usage thanks to being stored in a single
file, in addition, I know that incremental i/o is a possibility with
blobs in sqlite 3.4, which could potentially be utilized to optimize
work like this.

Anyways, please send a patch to the list if thats not too much to ask,
or just give us an update as to how things are going.

Cheers,
Kevin Kubasik

On 10/1/07, Kevin Kubasik <kevin kubasik net> wrote:
> On 8/19/07, Arun Raghavan <arunissatan gmail com> wrote:
> > Hello All,
> > This week I've been working on the new TextCache implementation that
> > I'd mentioned the last time (replacing the bunch of files with an
> > Sqlite db).
> >
> > Making an Sqlite db with just the uri and raw text caused an almost 3x
> > increase in the text cache size (3.6 MB (on-disk) vs. almost 15MB in
> > my test case). This despite the fact that the size of the raw text was
> > only 7.9 MB. I need to figure out why this happens. In the mean time,
> > I also implemented another version of this which stores (uri, gzipped
> > text) pairs in the Sqlite db instead of (uri, raw text). Surprisingly,
> > this actually seems to work very well (the db for the test case
> > mentioned shrunk down to 2.6 MB, which is just a little more than the
> > actual size of the compressed data itself).
> My first impression on this is that Sqlite is probably building an
> index for the raw text data. where as the compressed data is simply
> treated as a binary 'glob'. I'm not 100% sure of the table definitions
> that your using, or exactly how much (in terms of Indexes) sqlite does
> automatically, but that seems like the most likely culprit. As we
> already have our own system for searching text ;) if you could find a
> way to force sqlite to not index the table's raw text column, you
> could probably get more sane numbers regarding the database size.
> However, its possible, its just how sqlite handles text content, and
> the gzipped text is the best way to go. The other thing to test is how
> this is handled in far larger situations. Is it possible that the
> first 1000 rows are very expensive, but when we scale to 50000 rows,
> we see only a minute increase in size?
>
> >
> > Performance numbers on a search which returns 1205 results are below.
> > I basically ran the measurements twice -- once after flushing the
> > inode, dentry and page cache, and another time taking advantage of the
> > disk caches.
> >
> > Current TextCache:
> > no-disk-cache: ~1m
> > with-disk-cache: ~9s
> >
> > New TextCache (raw and gzipped versions had similar numbers):
> > no-disk-cache: ~42s
> > with-disk-cache: ~10s
> >
>
> Very cool/ interesting. One of the important cases to test here is
> multiple successive queries. Think like deskbar as a user types
> completion, how does such a system fair when it gets 15 or 20 queries
> back to back. Does the compression difference factor in then?
>
> > One very important factor remains to be seen -- memory usage. I am
> > working on figuring out what the impact of the new code on memory
> > usage is. Numbers should be available soon.
> >
> > On the Xesam front, I will be updating the code tomorrow,day-after to
> > reflect the latest changes to the spec.
>
> I know the Google SoC is over, and its completely ok if your too busy
> to complete these tests, but if would be awesome if you could provide
> a patch to the list so we can not only see exactly what you were
> doing, but so that someone else might finish up your work and/or get
> it merged in and ready for 0.3.0.
>
>
> > --
> > Arun Raghavan
>
>
> --
> Cheers,
> Kevin Kubasik
> http://kubasik.net/blog
>


-- 
Cheers,
Kevin Kubasik
http://kubasik.net/blog



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]