Building structured documents¶
Any structured documents can be viewed as a tree of higher-level
elements (such as chapters or sections) that contains smaller elements
(like subsections or lists) that each in turn contains even smaller
elements (like paragraphs or list items). When using ferenda, you can
create documents by creating such trees of elements. The
ferenda.elements
module contains classes for such elements.
Most of the classes can be used like python lists (and are, in fact,
subclasses of list
). Unlike the aproach used by
xml.etree.ElementTree
and BeautifulSoup
, where all
objects are of a specific class, and a object property determines the
type of element, the element objects are of different classes if the
elements are different. This means that elements representing a
paragraph are ferenda.elements.Paragraph
, and elements
representing a document section are
ferenda.elements.Section
and so on. The core
ferenda.elements
module contains around 15 classes that
covers many basic document elements, and the submodule
ferenda.elements.html
contains classes that correspond to
all HTML tags. There is some functional overlap between these two
module, but ferenda.elements
contains several constructs
which aren’t directly expressible as HTML elements
(eg. Page
,
:~py:class:ferenda.elements.SectionalElement and
:~py:class:ferenda.elements.Footnote)
Each element constructor (or at least those derived from
CompoundElement
) takes a list as an
argument (same as list
), but also any number of keyword
arguments. This enables you to construct a simple document like this:
from ferenda.elements import Body, Heading, Paragraph, Footnote
doc = Body([Heading(["About Doc 43/2012 and it's interpretation"],predicate="dcterms:title"),
Paragraph(["According to Doc 43/2012",
Footnote(["Available at http://example.org/xyz"]),
" the bizbaz should be frobnicated"])
])
Note
Since CompoundElement
works like
list
, which is initialized with any iterable, you
should normalliy initialize it with a single-element list of
strings. If you initialize it directly with a string, the
constructor will treat that string as an iterable and create one
child element for every character in the string.
Creating your own element classes¶
The exact structure of documents differ greatly. A general document format such as XHTML or ODF cannot contain special constructs for preamble recitals of EC directives or patent claims of US patents. But your own code can create new classes for this. Example:
from ferenda.elements import CompoundElement, OrdinalElement
class Preamble(CompoundElement): pass
class PreambleRecital(CompoundElement,OrdinalElement):
tagname = "div"
rdftype = "eurlex:PreambleRecital"
doc = Preamble([PreambleRecital("Un",ordinal=1)],
[PreambleRecital("Deux",ordinal=2)],
[PreambleRecital("Trois",ordinal=3)])
Mixin classes¶
As the above example shows, it’s possible and even recommended to use multiple inheritance to compose objects by subclassing two classes – one main class who’s semantics you’re extending, and one mixin class that contains particular properties. The following classes are useful as mixins:
OrdinalElement
: for representing elements with some sort of ordinal numbering. An ordinal element has anordinal
property, and different ordinal objects can be compared or sorted. The sort is based on the ordinal property. The ordinal property is a string, but comparisons/sorts are done in a natural way, i.e. “2” < “2 a” < “10”.TemporalElement
: for representing things that has a start and/or a end date. A temporal element has anin_effect
method which takes a date (or uses today’s date if none given) and returns true if that date falls between the start and end date.
Rendering to XHTML¶
The built-in classes are rendered as XHTML by the built-in method
render_xhtml()
, which first
creates a <head>
section containing all document-level metadata
(i.e. the data you have specified in your documents meta
property), and then calls the as_xhtml
method on the root body
element. The method is called with doc.uri
as a single argument,
which is then used as the RDF subject for all triples in the document
(except for those sub-elements which themselves have a uri
property)
All built-in element classes derive from
AbstractElement
, which contains a generic
implementation of as_xhtml()
,
that recursively creates a lxml element tree from itself and it’s
children.
Your own classes can specify how they are to be rendered in XHTML by
overriding the tagname
and
classname
properties, or for
full control, the as_xhtml()
method.
As an example, the class SectionalElement
overrides as_xhtml
to the effect that if you provide
identifier
, ordinal
and title
properties for the object, a
resource URI is automatically constructed and four RDF triples are
created (rdf:type, dcterms:title, dcterms:identifier, and bibo:chapter):
from ferenda.elements import SectionalElement
p = SectionalElement(["Some content"],
ordinal = "1a",
identifier = "Doc pt 1(a)",
title="Title or name of the part")
body = Body([p])
from lxml import etree
etree.tostring(body.as_xhtml("http://example.org/doc"))
…which results in:
<body xmlns="http://www.w3.org/1999/xhtml" about="http://example.org/doc">
<div about="http://example.org/doc#S1a"
typeof="bibo:DocumentPart"
property="dcterms:title"
content="Title or name of the part"
class="sectionalelement">
<span href="http://example.org/doc"
rel="dcterms:isPartOf" />
<span about="http://example.org/doc#S1a"
property="dcterms:identifier"
content="Doc pt 1(a)" />
<span about="http://example.org/doc#S1a"
property="bibo:chapter"
content="1a" />
Some content
</div>
</body>
However, this is a convenience method of SectionalElement, amd may not
be appropriate for your needs. The general way of attaching metdata to
document parts, as specified in Metadata about parts of the document, is to
provide each document part with a uri
and meta
property. These
are then automatically serialized as RDFa statements by the default
as_xhtml
implementation.
Convenience methods¶
Your element tree structure can be serialized to well-formed XML using
the serialize()
method. Such a
serialization can be turned back into the same tree using
deserialize()
. This is primarily useful
during debugging.
You might also find the
as_plaintext
method
useful. It works similar to
as_xhtml
, but returns a
plaintext string with the contents of an element, including all
sub-elements
The ferenda.elements.html
module contains the method
elements_from_soup()
which converts a
BeautifulSoup tree into the equivalent tree of element objects.