Re: firehose polling system



On Wed, 2008-01-23 at 18:27 -0500, Owen Taylor wrote:
> On Wed, 2008-01-23 at 16:39 -0500, Colin Walters wrote:
> > Hi,
> > 
> > Over the last few days I've been trying to design and code a new polling
> > system for Mugshot.  The main goal is to reduce the database traffic
> > we're currently doing now checking for updates.  This in turn should
> > improve the Online Desktop server by helping it stay up longer.
> > 
> > It's called Firehose.  The current design page is here:
> > http://developer.mugshot.org/wiki/Firehose
> > 
> > I'll commit some code soon to the Mugshot SVN.
> 
> Some questions:
> 
>  - How are "tasksets" sent from the master to the slave?

Basically the master has a list of active slaves, it just sends them
over a plain HTTP POST.  My current thinking is to use basic POST-like
APIs for communication that doesn't need to be reliable like SQS.

>  - How does a slave get details about information used to poll a task?
>    (like the URL to poll or whatever) The wiki page only describes a
>    task as a family/id pair.

It has a mapping from family->class implementation.

>  - How is data needed for polling like private keys distributed to the
>    slave?

Any private keys would have to be included in the configuration, as is
with the server now.

>  - Is there any affinity for tasks? Do we always execute the same task
>    on the same slave or each run of each task or is each run assigned
>    to a task independently?

They're independent.

>  - Do we have any way of sending If-Modified headers when applicable?
>    If we are running this on EC2, we'll be paying per gigabyte of
>    downloaded data.

That is a good point, I need to update the spec to say that the result
of a poll is (SHA1, timestamp).  We can use the timestamp to send the
If-Modified.

>  - Is the list of tasks persistently saved on the master, or does the
>    server send tasks again on restart?

It's stored persistently.  Currently it's a sqlite database.

>  - If we wanted to "sync up" the set of tasks that the master is
>    executing with the set that we should be running (I could imagine
>    them getting out of sync for various reasons if we keep the tasks
>    around persistently), how do we do it?

Another good point; what I was thinking of doing is writing a script to
create all the tasks now, but you could imagine having a specific
"dump/load" system where mugshot would store a snapshot of task IDs in
S3, and send a message to reload the firehose master from that list.

>  - Does the master implement "poll tasks faster after changes"?

Not yet, though we should do that.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]