Using the low-level API from Python / Please advise for WeasyPrint



Hi,

I’m a developer of WeasyPrint, an open-source layout and pagination engine for HTML/CSS written in Python: http://weasyprint.org

We render to cairo, with Pango and PangoCairo for text. (cairo can then export to various file formats, in particular PDF.)


*Warning*: long message ahead. Short version: how can we use the low-level Pango API from Python? Or should we even skip Pango and only use Harfbuzz, as suggested in http://behdad.org/text/ ?


In full details:

We use Pango not only to actually render the text, but also in the earlier layout step to find line breaks and get various sizing information about the text (advance width, leading, baseline, …)

This is currently done with high-level Pango API (Layout, LayoutLine, FontDescription, …) because it all that is available from Python, either through PyGTK or through PyGObject3-introspection.

However we can not give whole paragraphs at a time to Pango since we need a lot of inline-level control: one line could contain multiple elements with images, vertical alignment, relative positioning, …

Therefore we have a horrible hack where uninterrupted chunks of text (stopping at the next HTML tag) are passed to a PangoLayout with the available width until the end of the current line. Then, only the first LayoutLine is kept and the rest thrown away. The process is repeated for every line of the remaining text.

If you’re interested, the relevant code is in text.py and layout/inlines.py :
https://github.com/Kozea/WeasyPrint/tree/master/weasyprint

In addition to being obviously inefficient, this design has some limitations:

* The 'font-family' CSS property is just passed to FontDescription, so
  there is no conforming font matching algorithm[1] or @font-face[2]
* No way to add hyphenation (or did I miss something?)
* No control on line breaks. For examples when breaking at a space
  character, PangoLayout leaves the space at the end of the line and
  requires width for it, but do not report this width in the
  LayoutLine. If the available width is just enough for two
  words but not for two word plus a space, the break will be after the
  first word. This causes the "shrink-to-fit" algorithm to give
  incorrect results, as well as some CSS tests to fail.

[1] http://www.w3.org/TR/CSS21/fonts.html#algorithm
[2] Downloading fonts from the web:
http://www.w3.org/TR/css3-fonts/#font-resources

As the team working on WeasyPrint is very small, we’ve cut some corners and made design choices for ease of development rather than (for example) run-time speed. Using Pango like this has worked well enough (many thanks to all of you who worked on it!) but we’ll want to change that at some point in the future.

I think that the way forward is to switch to the low-level API. Could some introspection data be added to make it available from Python? Or should we write C and skip PyGObject? What about HarfBuzz, how is it relevant for this use case?


Thanks in advance for your advice.

Regards,
--
Simon Sapin


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]