Re: bugs.gnome.org.tar.gz ?



Martin Baulig <martin home-of-linux org> writes:

> Hi Miguel,
> 
> can you please send me a .tar.gz of all bugs from bugs.gnome.org
> or put it somewhere on cvs.gnome.org where I can find it ?
> 
> Ideally, I need all html files matching
> 
>         http://bugs.gnome.org/db/[\d+]/[\d+]\.html
> 
> I wgetted about 1000 or so of them this afternoon this afternoon
> and now I have a script which can parse them so we can resubmit them
> into Bugzilla.
> 
> The script is a very strict parser (ie. it discards any bug report which
> is not in the correct format and prints a warning message so we can
> manually look at it), but this only happened with 1 out of 961 bug reports
> so far so this is already pretty much usable.

Hmmm, are you sure you should be doing this from the HTML files?

In fact, I'm afraid you are not going to be able to do this, since
the history we have of closed bugs does not include the HTML
files (in theory, we could feed all the old bugs back into debbugs
to regenerate the HTML, but that sounds a bit scary and backwards.)

The opt-debbugs-spool.tar.gz file in my home directory on canvas.gnome.org
has all the bug reports up to the middle of August in there internal
storage format. There are three files for each bug:

 nnn.log
 nnn.report
 nnn.status

The status file is easily parseable, though you might have
to look at the debbugs sources (urgh) to figure out the format.

The report file is the original bug report, as a mail message.

The log file is all mail traffic on the subject, separated
by some funny control characters. (The meaning of which
will probably again require examination of the debbugs sources)

So, basically, the process of conversion is basically requires

 - a parser that understands bug submission format
   (including mime type handling to get attachments, if possible)
 - a parser that handles control messages [ should be easy ]
 - some basic glue that knows the funny control characters in
   the log file.

It doesn't sound all that bad, especially starting with MailTools
Perl modules (the ones that are used in my MhonArc wrapper
scripts, if you've looked at them.)

The secondary benefit of working from the logs is that it
will give us most of a debbugs compatibility layer as a side
benefit. 

Regards,
                                        Owen




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]