[Straw] Fixing async IO reliability



Hello all,
I'm a long-time happy user of Straw and started looking at the code 
recently in the context of fixing a release-critical bug for Debian:
#397469: straw: Does not work with python-adns installed
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=397469

I'm a big fan of asynchronous IO in favor of threaded concurrency, but in 
Straw it seems to be the cause of complexity and lack of reliability in 
the networking as seen in the code and bug reports. Perhaps the async code 
can be fixed for good, perhaps the feeds should be updated in separate 
worker threads. I have described a couple of issues below, all of which 
would disappear if threads were used instead. Consider this an offer to 
work on these issues, whether threading is to be used or not.


1. Loading feeds takes a very long time or times out

This is the problem fixed in the bug report above. Async code should only 
sleep in the GTK main loop. Instead, when the ADNS library is installed 
for async DNS lookups, Straw all the time sleeps 0.1 seconds in
URLFetch.py:167:lookup_manager.poll(timeout)
LookupManager.py:164:self.queryengine.run(timeout)
ADNS.py:46:self._s.completed(timeout)

This means asyncore.poll is hardly ever called in the following lines of 
URLFetch.py, which causes feed loading to stall. Changing the timeout 
parameter given to ADNS to 0 fixes this issue.

The if statement around asyncore.poll and the non-zero timeout distract 
fair scheduling and don't help much, so they should be removed as well.


2. Limited download speed

NetworkConstants.py currently defines
POLL_INTERVAL = 500

This means we read from the buffer at most 2 times per second giving a 
maximum download speed of 10 kilobytes per second per connection here. Not 
only do large feeds load slowly, but it's also holding resources of the 
remote server.

The problem is alleviated by changing POLL_INTERVAL to something like 10 
which gives a maximum download speed of 100 kilobytes per second - still 
far from the over 700 kilobytes per second that wget gives here.

A fix would mean integrating the GTK, asyncore and ADNS main loops into 
one, for example if asyncore and ADNS connections could be added to GTK 
watches. I have been using PyGTK and Twisted Python together with good 
results, perhaps an option would be to switch from asyncore to twisted - 
although this wouldn't fix the rest of the issues.


3. Feed URIs with IP addresses don't work, IPv6 and /etc/hosts don't work

ADNS is meant for DNS lookups in server software, not for name resolving 
in desktop applications. A user can expect IP addresses, IPv6, /etc/hosts 
etc. work as in every other app, but they can't work unless we use the 
system resolvers instead of ADNS. As a fix, IP addresses can be 
special-cased, ADNS may get IPv6 support, and the rest could be listed as 
a known "feature".

On the other hand, using threads we can call getaddrinfo in the Python 
libraries, which corresponds to the respective POSIX function and uses the 
system resolver.


4. Feed parsing hangs everything

This isn't quite an IO problem, but it would get fixed with threads. 
Another way would be to patch feedparser.py to use the incremental 
interfaces of xml.sax and sgmllib and feed the content in reasonably 
sized pieces.


I suppose changing from asyncore and ADNS to threading would require small 
changes all over the code. However, the functionality is mainly in 
LookupManager.py, URLFetch.py, and SummaryParser.py. I'd hope the 
threading model wouldn't get too complex if a worker thread was spawned 
for each feed update. The thread could independently perform hostname 
lookup, content downloading and feed parsing before in the main 
thread inserting the results into the user interface.


Would you think these ideas are of any use in the further development of 
Straw?

Regards,
Tuukka Hastrup

-- 
-- Trying to catch me? Just follow up my Electric Fingerprints
-- To help you: Tuukka Hastrup iki fi
                http://www.iki.fi/Tuukka.Hastrup/



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]