Re: [xml] Constraint validation for huge documents



Hi Nick,


Thanks for your reply. It does have a noticeable impact, while having compiled libxml2-git yesterday, I oversaw it.


With the single constraint file;
libxml2-2.9.10
User time (seconds): 90.81
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:31.60

libxml2-git
User time (seconds): 49.57
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:50.57

With the full constraint file;
libxml2-2.9.10
Not completed after 1 hour 30 min

libxml2-git
User time (seconds): 900.60
Elapsed (wall clock) time (h:mm:ss or m:ss): 15:02.87


Yesterday I wrote a custom validator in lxml for key/keyref and unique constraints. It basically validates syntactically using the normal libxml2 code, and then fetches all constraints (this might be a shortcut), creates a hashset per constraint. This process can be executed in parallel per constraint. If taking into account the number of elements (by heuristics, if the same xsd is used over time) parallelism can be ensured over a longer period.

With multithreading (8):
User time (seconds): 1136.37
Percent of CPU this job got: 388%
Elapsed (wall clock) time (h:mm:ss or m:ss): 4:57.09

Without multithreading:

User time (seconds): 709.82
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 11:52.15



I assume that the optimisation currently present in git is a serious improvement. Sure, it is still not 'perfect' but I think that doing the validation in parallel might be something worthwhile to explore.

--
Stefan


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]