[gtk-doc] TODO: more comments wrt. performance

From: Stefan Sauer <stefkost src gnome org>
To: commits-list gnome org
Cc:
Subject: [gtk-doc] TODO: more comments wrt. performance
Date: Sun, 20 May 2018 11:54:36 +0000 (UTC)
commit 8f355c4f432eee45720c3dc7bc42f6061b9879a6
Author: Stefan Sauer <ensonic users sf net>
Date:   Sun May 20 13:51:43 2018 +0200

    TODO: more comments wrt. performance

 TODO              |   35 +++++++++++++++++++++++++++--------
 gtkdoc/mkhtml2.py |   20 ++++++++++++++++----
 2 files changed, 43 insertions(+), 12 deletions(-)
---
diff --git a/TODO b/TODO
index 9fe99fd..7504973 100644
--- a/TODO
+++ b/TODO
@@ -414,10 +414,6 @@ grep "gst_caps_is_always_compatible" tags
         0m33.282s   0m29.266s  0m4.012s
       - removing the gentext calls for nav-bar alt tags does not help
 
-
-  - try plain docbook xslt to see if maybe we have bad xslt templates in the
-    customisation layer (gtk-doc.xsl)
-
   - we could do the xinlcude processing once and use it for both html and pdf
     time /usr/bin/xsltproc 2>../xslt4.log --path 
/home/ensonic/projects/gnome/gtk-doc/gtk-doc/tests/gobject/docs --nonet --xinclude --stringparam 
gtkdoc.bookname tester --stringparam gtkdoc.version 1.14 
/home/ensonic/projects/gnome/gtk-doc/gtk-doc/gtk-doc.xsl ../tester-docs.xml
     real        user       sys
@@ -454,12 +450,35 @@ grep "gst_caps_is_always_compatible" tags
     - unfortunately there is no way to ask xsltproc to pre-transform an xslt, that could
       - strip comments
       - process xsl:import and xsl:include
-  - compile xslt
-    http://sourceforge.net/projects/xsltc/
-    http://www.xmlhack.com/read.php?item=618
   - extra xsltproc options:
     --novalid: saves ~ 0.12 sec
-
+    
+  - strip DOCTYPES on xincludes
+    - there is a performance bottleneck in libxml, where it parses DTD for each fragment
+    - we're using the dtd to
+      - validate fragments
+      - inject package name/version etc.
+    - 1) we could provide a flags to gtkdoc-mkdb to not put any doctype on 
+         generated file and manually result entities in generated files (and
+         expand_content_files)
+      - to get a list of entities:
+        - we could parse entities from the main doc-file header
+          - tricky as with xml/gtkdocentities.ent, they are in a extra file
+        - we could pass entities as args for gtkdoc-mkdb
+        - if the flag is used, we should warn if entities are in the header
+    - 2) if the doctype on the main doc, does not conatin entities, skip adding
+         the doctype to generated output
+    - 3) if the doctype on the main doc contains entities, only add the doctype
+         if the generated content contains entities ([&%][_a-zA-Z]*;)
+    - a quick check on the gnome modules showed:
+      - quite a few don't use entities
+      - those that use version.xml
+        - seem to mostly use it in the main doc
+        - but a few use it for man pages
+          find . -name "*.xml" -exec grep -Hn "&version;" {} \; | grep -v "\-docs.xml"
+
+find . -name "*.xml" -exec egrep --color -Hn '&[_a-zA-Z]*;' {} \; | egrep -v '&(amp|lt|gt|quot|apos|nbsp);' 
| egrep --color '&[_a-zA-Z]*;'
+find . -name "*.xml" -exec egrep -o '&[_a-zA-Z]*;' {} \; | sort | uniq -c | sort -n
 
 = python =
 - consider swithcing to this markdown parser
diff --git a/gtkdoc/mkhtml2.py b/gtkdoc/mkhtml2.py
index 6256129..fef4876 100644
--- a/gtkdoc/mkhtml2.py
+++ b/gtkdoc/mkhtml2.py
@@ -43,15 +43,25 @@ TODO:
   - convert_{figure,table} need counters.
 - check each docbook tag if it can contain #PCDATA, if not don't check for
   xml.text/xml.tail and add a comment (# no PCDATA allowed here)
-- consider some perf-warnings flag
-  - see 'No "id" attribute on'
 - find a better way to print context for warnings
   - we use 'xml.sourceline', but this all does not help a lot due to xi:include
 - consolidate title handling:
   - always use the titles-dict
+    - convert_title(): uses titles.get(tid)['title']
+    - convert_xref(): uses titles[tid]['tag'], ['title'] and ['xml']
+    - create_devhelp2_refsect2_keyword(): uses titles[tid]['title']
   - there only store what we have (xml, tag, ...)
   - when chunking generate 'id's and add entries to titles-dict
   - add accessors for title and raw_title that lazily get them
+  - see if any of the other ~10 places that call convert_title() could use this
+    cache
+- performance
+  - consider some perf-warnings flag
+    - see 'No "id" attribute on'
+  - xinclude processing in libxml2 is slow
+    - if we disable it, we get '{http://www.w3.org/2003/XInclude}include' tags
+      and we could try handling them ourself, in some cases those are subtrees
+      that we extract for chunking anyway
 
 DIFFERENCES:
 - titles
@@ -1761,11 +1771,13 @@ def main(module, index_file, out_dir, uninstalled, src_lang, paths):
     # 1) load the docuemnt
     _t = timer()
     # does not seem to be faster
-    # parser = etree.XMLParser(collect_ids=False)
+    # parser = etree.XMLParser(dtd_validation=False, collect_ids=False)
     # tree = etree.parse(index_file, parser)
     tree = etree.parse(index_file)
+    logging.warning("1a: %7.3lf: load doc", timer() - _t)
+    _t = timer()
     tree.xinclude()
-    logging.warning("1: %7.3lf: load doc", timer() - _t)
+    logging.warning("1b: %7.3lf: xinclude doc", timer() - _t)
 
     # 2) copy datafiles
     _t = timer()
[Date Prev][Date Next] [Thread Prev][Thread Next] [Thread Index] [Date Index] [Author Index]