Re: [xml] Constraint validation for huge documents
- From: Stefan de Konink <stefan konink de>
- To: Nick Wellnhofer <wellnhofer aevum de>
- Cc: <xml gnome org>
- Subject: Re: [xml] Constraint validation for huge documents
- Date: Tue, 05 Jan 2021 19:12:10 +0100
Hi Nick,
Thanks for your reply. It does have a noticeable impact, while having
compiled libxml2-git yesterday, I oversaw it.
With the single constraint file;
libxml2-2.9.10
User time (seconds): 90.81
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31.60
libxml2-git
User time (seconds): 49.57
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:50.57
With the full constraint file;
libxml2-2.9.10
Not completed after 1 hour 30 min
libxml2-git
User time (seconds): 900.60
Elapsed (wall clock) time (h:mm:ss or m:ss): 15:02.87
Yesterday I wrote a custom validator in lxml for key/keyref and unique
constraints. It basically validates syntactically using the normal libxml2
code, and then fetches all constraints (this might be a shortcut), creates
a hashset per constraint. This process can be executed in parallel per
constraint. If taking into account the number of elements (by heuristics,
if the same xsd is used over time) parallelism can be ensured over a longer
period.
With multithreading (8):
User time (seconds): 1136.37
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.09
Without multithreading:
User time (seconds): 709.82
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:52.15
I assume that the optimisation currently present in git is a serious
improvement. Sure, it is still not 'perfect' but I think that doing the
validation in parallel might be something worthwhile to explore.
--
Stefan
[
Date Prev][
Date Next] [
Thread Prev][
Thread Next]
[
Thread Index]
[
Date Index]
[
Author Index]