Re: New developments on Caribou



Hi,

marmuta wrote:
I think there is scope to join forces between presage and onboard.

presage is architected to merge predictions generated by a set of predictors. Each predictor uses a different language model/predictive algorithm to generate predictions.

Currently presage provides the following predictors:
ARPA predictor: statistical language modelling data in the ARPA
N-gram format
generalized smoothed n-gram statistical predictor: generalized
smoothed n-gram statistical predictor can work with n-gram of
arbitrary cardinality recency predictor: based on recency promotion
principle dictionary predictor: generates a prediction by returning
tokens that are a completion of the current prefix in alphabetical
order abbreviation expansion predictor: maps the current prefix to a
token and returns the token in a prediction with a 1.0 probability
dejavu predictor: learns and then later reproduces previously seen
text sequences.

A bit more information on how these predictors work is available
here: http://presage.sourceforge.net/?q=node/15


It sounds like the language model and predictive algorithm used in
the onboard word-prediction branch is an ideal candidate to be
integrated into presage and become a new presage predictor class.
Pretty interesting stuff, but from looking over its feature list I'm
wondering what presage would gain. There doesn't seem to be much
onboards prediction could add that isn't implemented already.
Roughly compared, gpredict (name is subject to change) covers
these presage components:

- generalized smoothed n-gram statistical predictor - recency predictor (with exponential falloff)
- dictionary predictor (word completion)
- dejavu predictor? (if it does continuous on-line learning)

The main difference, apart from the general architecture, may be that
gpredict uses dynamically updatable language models, handy for on-line
learning. I'm not completely sure, but it seems presage's three n-gram
predictors are based on immutable models and the dejavu predictor keeps
a separate adaptable model of unigrams.

The generalized smoothed n-gram predictor does continuous on-line learning (learning can be turned on or off at runtime or via configuration). When learning is turned on, the language model is updated on the fly with new n-gram counts.

The dejavu predictor is just a toy predictor, really. I wrote it to try things out when I started implemented continuous online learning functionality and it now serves as simple example of how to implement a learning predictor class.

Similarly, the smoothed count predictor and the 3-gram smoothed predictor are remnants from a time when I was experimenting with language models and really are building steps towards the generalized smoothed n-gram predictor, which is currently the main statistical predictor (along with the ARPA predictor).

presage could then be the engine used to power the d-bus prediction service, offering the predictive capabilities of the onboard language model/predictor, plus all the predictors currently provided by
presage (all of which can be turned on/off and configured to suit
individual needs).
The modularity could be helpful, even though I'm not sure if I could
really make use of it.

We were very concerned about memory usage and had initially thought
about using static ARPA compatible structures for large immutable
language models and dynamically updatable models only for on-line
learning. However later the dynamic models turned out to be almost as
efficient as the ARPA implementation and so now there are (flavors of)
dynamic models for everything.

Similar consolidation happened with recency caching. It was originally
planned as a separate modular component. However that would have meant
redundant storage of n-grams and a forced limit to some arbitrarily
small number of recent n-grams. So I had it integrate more closely with
the generic dynamic models, gaining recency tracking across all known
n-grams but sacrificing some modularity (there is still variability
through inheritance though).

If onboard's current predictive functionality was merged into presage and encapsulated into a (say, for lack of a better name) OnboardPredictor class, then presage's modularity would be useful because it would allow us to: - replicate exactly the same predictive functionality of current gpredict service, by switching on OnboardPredictor and turning off other predictors - augment OnboardPredictor predictive functionality with other predictors currently provided by presage, as desired by onboard or the user, simply by modifying a config variable.

Presage would definitely benefit from having a new and high-quality predictor in its core.

The presage core library itself has minimal dependencies: it pretty
much only needs a C++ runtime and sqlite, which is used as the
backing store for n-gram based language models (this ensure fast
access, minimum memory footprint and no delays while loading the
language model in memory).
That is definitely an advantage as gpredict currently takes around 5s
(@3GHz) to load the english base model with ~1.4 million n-grams.
Memory usage may or may not be an issue, the D-Bus service with only
English as the resident language takes around 30MB.

I trained presage's smoothed n-gram predictor language model on the text corpora currently using by gpredict to yield a language model with ~1.2 million n-grams, compared to presage default language model, which is trained on a single text (namely the Picture of Dorian Gray), totaling about ~75000 n-grams.

The increase in prediction time and resident memory required on a control text is very small compared to the increase in n-grams: ~75 thousands n-grams -- prediction time: ~7 seconds, resident memory size: ~3MB ~1.2 millions n-grams -- prediction time: ~17 seconds, resident memory size: ~5MB

This preliminary testing shows that prediction time and memory consumption does not grow linearly with the number of n-grams.

That said, when I first saw presage, I wasn't too happy about its sqlite
dependency. Sqlite often means frequent hard drive accesses and a choice
between general slowness due to generous fsync'ing or all bets off
concerning data security. That may be unfounded prejudice in this
case and perhaps presage has all that overcome. I didn't do any real
world testing with it.

Yes, that's the trade-off to have the language model on disk rather than in memory. There's advantages and disadvantages to having the lm reside in memory or on disk.

The great thing about it is that, strictly speaking, it's not presage that has a dependency on sqlite, but rather the individual predictors that store their language model in an sqlite database. In other words, the dependency on sqlite could be removed from the presage library itself, and moved to the smoothed n-gram predictor. This would be very little work (a 10 minutes job I believe).

In practice, I found sqlite very fast and reliable. Presage database connector layer encloses all writes to the database (and reads too, for that matter) in transactions, which guarantees atomicity of updates to the language model.

For details about the word prediction service, please contact
marmuta that did nearly all the work about the word prediction
service.
I'll follow up with marmuta to discuss the feasibility of making this happen and work out the technical details, in case there is consensus
to go ahead with this.
I'm happy to further discuss this, even though I'm a bit torn currently.

I can see the appeal of having presage (or other candidates like nltk)
be the central repository for all kinds of prediction needs. On the
other hand the advantages of merging gpredict into presage don't seem
to be that obvious. Most of the functionality does exist already in
presage and from onboards point of view using presage appears to
currently gain it little except for new dependencies.

I need to look at gpredict language model and predictive algorithm in more detail, but I currently believe that presage will benefit from having a new predictor available, which can be turned on and combined with the existing predictors.

onboard would benefit from having access to presage's other predictors, which can be configured on or off and customized by the user (i.e. abbreviation expansion predictor).

Also onboard's prediction service was already meant to be a full
featured standalone word predictor. It is largely working as planned
and we were going to split it off from onboard as a ready-to-use D-Bus
service soon. Rebasing on presage at this point would probably delay
things considerably for onboard. Not sure yet if this is the right
thing to do, but I'm open for pro-arguments.

Well, I understand the concerns about delaying things for onboard, but I think there are significant benefits in integrating gpredict and presage together and building a prediction D-Bus service on presage.

Perhaps we could start with trying onboard with the presage D-Bus service that David has created, while we integrate gpredict into presage (basically, it would mean moving the C++ code into it class implementing a Predictor interface). I'm willing to help with this.


Cheers,
- Matteo



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]