beagle r3304 - in trunk/beagle: Filters Filters/HtmlAgilityPack beagled



Author: dbera
Date: 2007-01-22 16:47:16 +0000 (Mon, 22 Jan 2007)
New Revision: 3304
ViewCVS link: http://svn.gnome.org/viewcvs/beagle?rev=3304&view=rev

Modified:
   trunk/beagle/Filters/FilterHtml.cs
   trunk/beagle/Filters/HtmlAgilityPack/HtmlDocument.cs
   trunk/beagle/beagled/ExtractContent.cs
   trunk/beagle/beagled/Filter.cs
Log:
* ExtractContent.cs: Handle --tokenize properly for DisplayContent since now it takes a block of characters instead of a line.
* Filter.cs: Add whitespace after a word in hotpool.
* HtmlDocument.cs: Enable pausing and resuming of html parsing. A long way from DOM style parser to event-driven parser to now effectively stream parsing.
* FilterHtml.cs: Instead of extracting all the text in DoOpen(), extract all information in <head> in DoPullProperties() and then extract all text in DoPull(). Use the pause and resume features of HtmlDocument to send only some text in each call of DoPull(). Remove the stack based hotness and ignore-state detection - use just a counter now. Use AppendWord instead of AppendText wherever for appending href-s and img alt-s since they won't contain newlines.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]