Re: Proposal for bookmarks/history database



On Tue, 2005-11-15 at 09:50 +1100, Peter Harvey wrote:
> Hi all,
> 
> I think people are generally getting tired of EphyNode and some of it's
> issues. In many ways the basic *idea* is nice, but it's just become too
> complicated. A proposal to replace the bookmarks/history databases
> (which are EphyNode collections) with SQLite was floated. I'm going to
> take a shot at defining the functions of an SQLite-backed database for
> history/bookmarks.

Sounds like a lot of fun!

> Second, memory usage can be a pain. I believe we currently keep in a
> copy of all the information for each bookmark, and *then* make a copy
> for each widget which uses that information. This is a GTK+ thing, and
> unavoidable.

So using SQLite doesn't make memory usage any worse. In fact, since it's
disk-based, we're essentially determining an upper bound on memory usage
at compile-time (not counting GTK's memory usage). For large databases,
memory usage would be better than with EphyNode. (The break-even point
might be at several thousand nodes, depending on our choice of cache
size, though.)

> Database layout (warning: I am not a database person).
> ------------------------------------------------------

(I am)

> Each 'node' of information is identified by a unique integer. We store
> the statement "node X belongs to a set Y" as "node X is a child of node
> Y".
> 
> table 'set'
>   int parent      # index on this field
>   int child
> 
> table 'pages'
>   int id          # primary key
>   string url      # url
>   string title    # title from last visit
>   string icon     # icon from last visit
>   string btitle   # NULL if not bookmarked otherwise it's bookmark title
>   string bicon    # NULL if should default to icon
>   date visit      # date of visit
> 
> table 'sites'
>   int id          # primary key
>   string name     # things like 'slashdot.org' or 'abc.com'
> 
> table 'topics'
>   int id          # primary key
>   string title    # title of the topic

You want 'sites' and 'topics' to share a primary key: this means they
could share the same table, which would provide more extensibility.

table 'groupings'
  int id            # primary key
  short int group_type_id # type of grouping: enum for 'sites'/'topics'
  int name          # topic or site name

(To me, this type of group_type_id is reminiscent of EphyNode's database
IDs.)

If we could make group_type_id more abstract, we might even be able to
let extensions define arbitrary group types. Assigning IDs could be
tricky, though (just as it is now with EphyNode DB IDs). This may not be
necessary in a preliminary API, but it's a problem we've already had
with EphyNode, so it's something to keep in mind.

Somebody mentioned metadata. Maybe we should take a page out of
Storage's book? There is an effective design pattern in Storage's source
code on GNOME CVS, but I'd definitely leave that out of any initial
implementation. (It would be added in separate tables keyed by the same
ID as in 'pages'.) The ephy_db_query_* API would need to be modified
(maybe using varargs, like g_object_get()) if we want to go down that
route.

I'm not sure I like the "pages.b*" fields (they're mostly NULL, which is
icky), but I can't think of a better method right now.

> Basic query mechanisms
> ----------------------

> An example query for ephy_db_query_sites is:
>   SELECT <fields from data bitmask> FROM
>      (SELECT DISTINCT child FROM set WHERE parent=<parent>) set,
>      pages
>   WHERE pages.id = set.child
>   GROUP BY <fields from group bitmask>;

1. You're missing ORDER BY -- one of the key advantages of using SQLite
is that it can help us order stuff very quickly.
2. By my understanding, that DISTINCT is superfluous. Am I missing
something?
3. That's a rather wonky syntax (though it may be faster, I don't know).
A more standard one would avoid subselects:

SELECT <fields from data>
FROM pages
INNER JOIN set ON pages.id = set.child
WHERE set.parent = <parent>
GROUP BY <fields from group>
ORDER BY <fields from group>

Though your query may be faster: with my previous experience with
SQLite, I've noticed it handles subselects a lot more easily than joins.
Maybe my proposal is a step backwards. If I recall correctly, my
optimized queries looked like this on my last project:

SELECT <fields from data>
FROM pages
WHERE id IN (SELECT child FROM set WHERE parent = <parent>)
GROUP BY <fields from group>
ORDER BY <fields from group>

I recommend trying all three and seeing which works best :).

> Basic data mechanisms
> ---------------------
> 
> A simple caching scheme will allow multiple ephy_db_query_page calls
> (made in quick session to the same node) to be performed very fast. A
> simple getting, setting and signaling scheme will also make life easy.

SQLite already does caching. We could save a lot of work by taking
advantage of it. All we have to do is specify a cache size. (This would
work extremely well with the structure above, but maybe BLOB metadata
would be a problem....)

> Now, the callbacks are called whenever a node or the child of a parent
> node changes (respectively). The callback is passed a bitmask of the
> data fields that changed. 
> 
> To know what parents to signal we will need to execute a query on the
> database. This is slow, but very rare, and will be very fast if we just
> cache information for the most recently touched/created node.

"SELECT parent_id FROM groupings WHERE child_id = <child>" -- this will
be *blazingly* fast, considering every modification of a child requires
user interaction.

> We may need some specialist functions, but they would become trivial to
> implement. In particular a "this child was added to a parent" would be a
> nice signal and data combo.

Yeah, this would be a really awesome hook for extensions.

> OK, that's it. Let me know what you think.

I, for one, love it :).

Adam

Attachment: signature.asc
Description: This is a digitally signed message part



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]