Re: [xml] performance of parsing docbook with xincludes

From: Stefan Sauer <ensonic hora-obscura de>
To: "Eric S. Eberhard" <eric vicsmba com>
Cc: xml gnome org
Subject: Re: [xml] performance of parsing docbook with xincludes
Date: Thu, 7 Jun 2018 09:22:54 +0200

On 06/07/2018 12:54 AM, Eric S. Eberhard wrote:

I know I am the oddball here but -- why use DTDs at all?

I gave reasons above. I am working on a tool. How people using the tool is not under my control. Maybe we can focus on the opportunity to improve libxml2 a bit here.

I supply software to a lot of companies (thousands through dealers). Many exchange millions of XML docs per day. I've used this since it was libxml. Even have some patches in there. My application is proprietary (meaning XML to get an order or tell a customer our availability is simply XML I designed and documented and give to my customer's customers (via download from a Web page)). Once they get it working it pretty much always works. They write software to create orders and send them to us -- it is consistent (I know, not everyone has this luxury so this may not apply to everyone). So why check them?

I also found that I was getting a gagillion support tickets because DTDs ... simple things like a date ... seem to escape people -- take June 7, 2018

In our date fields we will take:
    Jun 7 2018
    June 7 2018
    the above with commas and any case (upper/lower/mixed)
    6/7/18
    6/7/2018
    2018/6/7
    20180607
    180606
     06-07-18

And actually many many more. Anything that is a date goes through this one routine and if there is any way in the world to extract a date, we do.

Ditto money -- say $1,245.56

We accept:
    $1,245.56
      1245.56
      124556        (decimal is implied at 2 places if no decimal is found)
       1,235.56

And many more - same thing, one routine reads it and if we can possibly get a reasonable number, we do.

This, in turn, reduced our CONSTANT support tickets for silly things like a format of something to ZERO. Which I like.

Even sicker -- we ignore case on tags. All of our XML is designed to not use duplicate names with different cases (stupid thing to do anyway -- expect orderNumber and OrderNumber to both be used, as different things).

As long as the customer is consistent and the XML is well formed we scan the tree and compare tags without regard to case. A WHOLE LOT more support tickets gone.

A lot of the people we deal with are not sophisticated. As the receiver of XML we decided it was much better to be as flexible as possible and take what we can if at all possible. After all -- a DTD can indeed tell you if an address comes in without a city name. And reject it and usually generate a support ticket. Since we use an on-line AVS system (more XML) and if we have the zip and the address otherwise matches ... we don't need the city and state ... the AVS system provides it. And if it fails they will get an error back from us (from the application) anyway. So why use a DTD to see if the city or state were sent? A LOT MORE support calls removed.

And, of course, performance without the DTDs is much better.

As a result we are able to give documentation to new customers and they are able to get it up and running with little to no help. Any serious errors we cannot fix are clearly explained in the responses BY THE APPLICATION and not by a DTD.

Being flexible on our end reduces support tickets which is all I care. I would rather code for all the mistakes I can think of an enduser would make (and we add new ones when they crop up) than be strict and do a lot of support. We don't think DTDs are flexible enough. And I hate making them :-)

We do offer a page with DTDs they can use manually to check their document if they like -- or they can send it to our test system. Once they are running they seem to do just fine.

As programmers it is hard to believe but sometimes it is better for us to make slightly less efficient code in order to make the human aspect much more efficient. I once had someone send me a link to a "contest" which was a convoluted C statement and asking to solve what the result would be. My response -- "fire the programmer!"

If it takes 100s of competent C programmers to get the right answer (and only a small percent did) to read a line of code -- it is bad code. And for people's information, modern computers read ahead and pre-execute code based on all kinds of weird logic. Simple C code is easy for it to handle ... but convoluted code ends up stopping the pre-execution and is actually slower -- may have less lines of code -- but it will be slower. I see nothing wrong with short clear clean code with as little craziness as possible. This is the same with XML -- one can go overboard easily, K.I.S.S. :-)

Not being so strict and no DTDs has had other benefits -- say EDI (from old IBMs) -- we have a cheap program that maps EDI to XML and back. So we can handle EDI -- and we don't need new software (after the conversion). We accept the EDI, convert to XML, run our standard application, create XML response, which is converted to EDI. The package we use is low cost and no, it won't work too well with DTDs as EDI has it's own problems.

I could go on but most of you have probably skipped this post by now :-)

E

On 6/6/2018 3:00 PM, Stefan Sauer wrote:
On 05/17/2018 06:01 PM, Stefan Sauer wrote:
  
On 05/17/2018 04:18 PM, Nick Wellnhofer wrote:
    
On 16/05/2018 21:51, Stefan Sauer wrote:
      
So one solution could be another flag to enable this?
        
Yes, but it would be rather ugly.
      
In which sense? I guess because it is something that noone should need
to know about or have to care about?
    
Thanks, reading the code. Need to figure where we could cache external
subsets and what a suitable keys is (ExternalID ?).
        
Note that I'm currently not planning to review and integrate larger
patches from other developers. I only took over some libxml2
maintenance duties because noone else did. So even if you write a
high-quality patch, it might never get merged.
      
Thanks for making this clear upfront. This is how I ended up becoming
the gtkdoc maintainer :)

    
Caching external subsets for XIncludes certainly sounds like a nice
feature but I would prefer to find a simpler solution. For example,
can't you just omit the external DTD from included documents?
      
Yeah, right now, the benefit of having the DTD is that one can validate
fragments. I'll do some research (aka grepping over existing projects)
to see how the doc-type headers being used today look like. If all that
people do is using an entity to inject the version, I'll write a
migration tool.

We have a test that validates the doc, but I think I can change this to
just resolve all xincludes and check through the top-level doctype.
    
Just to add to this, I am assuming a lot of people follow this book
http://www.sagehill.net/docbookxsl/ModularDoc.html#UsingXinclude

and using a DOCTYPE is part of the examples.
  
You wrote:

      
and gtk-doc will replicate this for the fragments (replacing 'book' with
e.g. 'refentry'). This way one can e.g. inject things like a version.
        
What do you mean by "inject things like a version"? Why exactly do
your included documents have to reference an external DTD?
      
The documentation consists of a handwritten master doc (type book), that
includes more handwritten parts (e.g. tutorials, guides) and include
generated reference docs. When gtkdoc generated the reference docs, it
applies takes the doctype header of the master-doc as a template and
uses that for the generated reference docs. If the master doc has
entities declared, those can be expanded in the reference fragments.
Thats the part I will check how widely it is actually used.

Stefan

    
Another idea is to stop loading external DTDs for XIncludes without an
XPointer _expression_. This would still change the behavior for some
users but it's much less likely to cause problems.
      
change the behaviour, as in we would not catch validation errors?
Too bad that xmlXIncludeParseFile() does not get the parent parserCtx,
in that case we could apply the same flags'.
  
Nick
      
I definitely don't know enough about the implications here. I was mostly
thinking to see if we can stick a dictionary of <dtd-identifier,
xmlDtdPtr> into the Parser Context and before actually loading a dtd,
check if we did already and reuse. Somehow the dict needs to be stored
in the top-level doc, when parsing is done (do we need the dtds once the
doc has been parsed?). We only free the dtds with the top-level doc. But
I agree, it is not going to be a two liner.
    
It seems that xmldict is only handling key and value to be a string,
right? So, we'll even need out one cache data structure. I'd say it
would need to be on the _xmlXIncludeCtxt level. global is easier, but
then we can't free it ever :/

Stefan
  
Stefan


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
https://mail.gnome.org/mailman/listinfo/xml
    
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml gnome org
https://mail.gnome.org/mailman/listinfo/xml

  
-- 
Eric S. Eberhard
VICS
2933 W Middle Verde Road
Camp Verde, AZ  86322

928-567-3727  work                      928-301-7537  cell

http://www.vicsmba.com/index.html             (our work)
http://www.vicsmba.com/ourpics/index.html     (fun pictures)

[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]