[gi-docgen: 1/2] Ensure DevHelp sections & index.json have stable ordering




commit cc21241d4386d4f78dbcb087fd9a92899935cb5c
Author: Alexandre Macabies <web+oss zopieux com>
Date:   Wed Apr 7 02:54:22 2021 +0200

    Ensure DevHelp sections & index.json have stable ordering
    
    It was discovered through NixOS reproducible build initiative[0] that
    gi-docgen introduces non-determinism in the ordering of some DevHelp
    files[1] and index.json. It turns out this is caused by concurrent
    generation of the various symbol sections, a performance optimization
    that inserts into a 'sections' dict as threaded workers complete,
    which is by nature not a reproducible task. In the case of the index, I
    was not able to find the source of the randomness, but it's likely
    caused by file enumerations.
    
    This commit adds a final sort on relevant dicts and lists to restore
    determinism in how these structures are iterated on between runs of the
    program. The exact iteration order does *not* matter, only the fact
    that it is stable given the same input. Since Python 3.7, dict
    iteration order is guaranteed to be insertion order[2], so this is
    working as intended.
    
    This introduces no performance penalty since Python does not copy the
    dict items, which are (str, list) tuples, and sorts lists in-place.
    
    Fixes #73.
    
    [0] https://r13y.com/
    [1] 
https://r13y.com/diff/af78aa6744b6df28036f25d6e6cbc4c5dac475a1e91d9c3c2b6532815d66b590-e22063a648e9a7c712ca6aa8b8bb9599ccabeb9f32b45cb24b1dc477af99b882.html
    [2] https://docs.python.org/3.7/library/stdtypes.html#typesmapping

 gidocgen/gdgenerate.py   | 4 ++++
 gidocgen/gdgenindices.py | 5 +++++
 2 files changed, 9 insertions(+)
---
diff --git a/gidocgen/gdgenerate.py b/gidocgen/gdgenerate.py
index d432209..1cabaad 100644
--- a/gidocgen/gdgenerate.py
+++ b/gidocgen/gdgenerate.py
@@ -2616,6 +2616,10 @@ def gen_reference(config, options, repository, templates_dir, theme_config, cont
             else:
                 template_symbols[section] = res
 
+    # The concurrent processing introduces non-determinism. Ensure iteration order is reproducible
+    # by sorting by key. This has virtually no overhead since the values are not copied.
+    template_symbols = dict(sorted(template_symbols.items()))
+
     ns_tmpl = jinja_env.get_template(theme_config.namespace_template)
     ns_file = os.path.join(ns_dir, "index.html")
     log.info(f"Creating namespace index file for {namespace.name}-{namespace.version}: {ns_file}")
diff --git a/gidocgen/gdgenindices.py b/gidocgen/gdgenindices.py
index 09005f8..ad1a5da 100644
--- a/gidocgen/gdgenindices.py
+++ b/gidocgen/gdgenindices.py
@@ -746,6 +746,11 @@ def gen_indices(config, repository, content_dir, output_dir):
         log.debug(f"Generating symbols for section {section}")
         generator(config, stemmer, index, repository, s)
 
+    # Ensure iteration order is reproducible by sorting symbols by type/name,
+    # and terms by key. This has no overhead since values are not copied.
+    index["symbols"].sort(key=lambda s: (s["type"], s["name"]))
+    index["terms"] = dict(sorted(index["terms"].items()))
+
     data = json.dumps(index, separators=(',', ':'))
     index_file = os.path.join(output_dir, "index.json")
     log.info(f"Creating index file for {namespace.name}-{namespace.version}: {index_file}")


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]