Re: [xml] semantics of XML_PARSE_NONET when reusing parser contexts

Hi Daniel,

Daniel Veillard wrote:
On Fri, Jul 27, 2007 at 11:06:36AM +0200, Stefan Behnel wrote:
one of the lxml users noticed that libxml2 changes behaviour when you set the
NONET option for xmlCtxtReadFile() and then call it twice on a network URL.
The first time, it parses the external document. The second time, it refuses
to parse it.

The problem lies in the handling of the parser options, which are only set
*after* the first call to xmlLoadExternalEntity(), in the following call to
xmlDoRead(). I think this is ok in general as it allows users to parse from a
URL by passing it in but to avoid additional network access when loading
external entities transitively (DTDs etc.) - is this the intended semantics of
the NONET option?

  Hum, no. The NONEt semantic is that any access outside the local filesystem
should genrate an error. Note that if you have a catalog remapping external
resources to local ones, then they should proceed without failure.

Sounds like a bug then. But I actually find that behaviour useful. You can
check yourself if the URL you want to parse is a network URL, but you can't
easily check if external entities in the respective document come from the
network. So the current behaviour allows you to be more selective in what you
want to restrict.

Depending on how contexts are reused in an application, this can lead to
unpredictable behaviour. In lxml, we can work around this by resetting the
context options after parsing, but I would like to see the intended semantics
of the NONET options cleared up and see reliable behaviour here.

  In general you should always reset the parsing context, like xmlCtxtRead*
function do.

Right, they already do that. So the problem is not resetting the context, the
problem is a difference in behaviour if the options were already set on the
context or not. It just leaks state.


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]