Creating your own document repositories¶
The next step is to do more substantial adjustments to the download/parse/generate cycle. As the source for our next docrepo we’ll use the collected RFCs, as published by IETF. These documents are mainly available in plain text format (formatted for printing on a line printer), as is the document index itself. This means that we cannot rely on the default implementation of download and parse. Furthermore, RFCs are categorized and refer to each other using varying semantics. This metadata can be captured, queried and used in a number of ways to present the RFC collection in a better way.
Writing your own download
implementation¶
The purpose of the download()
method
is to fetch source documents from a remote source and store them
locally, possibly under different filenames but otherwise bit-for-bit
identical with how they were stored at the remote source (see
File storage for more information about how and where files are
stored locally).
The default implementation of
download()
uses a small number of
methods and class variables to do the actual work. By selectively
overriding these, you can often avoid rewriting a complete
implementation of download()
.
A simple example¶
We’ll start out by creating a class similar to our W3C class in
First steps. All RFC documents are listed in the index file at
http://www.ietf.org/download/rfc-index.txt, while a individual
document (such as RFC 6725) are available at
http://tools.ietf.org/rfc/rfc6725.txt. Our first attempt will look
like this (save as rfcs.py
)
import re
from datetime import datetime, date
import requests
from ferenda import DocumentRepository, TextReader
from ferenda import util
from ferenda.decorators import downloadmax
class RFCs(DocumentRepository):
alias = "rfc"
start_url = "http://www.ietf.org/download/rfc-index.txt"
document_url_template = "http://tools.ietf.org/rfc/rfc%(basefile)s.txt"
downloaded_suffix = ".txt"
And we’ll enable it and try to run it like before:
$ ./ferenda-build.py rfcs.RFCs enable
$ ./ferenda-build.py rfc download
This doesn’t work! This is because start page contains no actual HTML
links – it’s a plaintext file. We need to parse the index text file
to find out all available basefiles. In order to do that, we must
override download()
.
def download(self):
self.log.debug("download: Start at %s" % self.start_url)
indextext = requests.get(self.start_url).text
reader = TextReader(string=indextext) # see TextReader class
iterator = reader.getiterator(reader.readparagraph)
if not isinstance(self.config.downloadmax, (int, type(None))):
self.config.downloadmax = int(self.config.downloadmax)
for basefile in self.download_get_basefiles(iterator):
self.download_single(basefile)
@downloadmax
def download_get_basefiles(self, source):
for p in reversed(list(source)):
if re.match("^(\d{4}) ",p): # looks like a RFC number
if not "Not Issued." in p: # Skip RFC known to not exist
basefile = str(int(p[:4])) # eg. '0822' -> '822'
yield basefile
Since the RFC index is a plain text file, we use the
TextReader
class, which contains a bunch of
functionality to make it easier to work with plain text files. In this
case, we’ll iterate through the file one paragraph at a time, and if
the paragraph starts with a four-digit number (and the number hasn’t
been marked “Not Issued.”) we’ll download it by calling
download_single()
.
Like the default implementation, we offload the main work to
download_single()
, which will look
if the file exists on disk and only if not, attempt to download it. If
the --refresh
parameter is provided, a conditional get is
performed and only if the server says the document has changed, it is
re-downloaded.
Note
In many cases, the URL for the downloaded document is not easily
constructed from a basefile
identifier. download_single()
therefore takes a optional url argument. The above could be written
more verbosely like:
url = "http://tools.ietf.org/rfc/rfc%s.txt" % basefile
self.download_single(basefile, url)
In other cases, a document to be downloaded could consists of several
resources (eg. a HTML document with images, or a PDF document with the
actual content combined with a HTML document with document
metadata). For these cases, you need to override
download_single()
.
The main flow of the download process¶
The main flow is that the download()
method itself does some source-specific setup, which often include
downloading some sort of index or search results page. The location of
that index resource is given by the class variable
start_url
.
download()
then calls
download_get_basefiles()
which
returns an iterator of basefiles.
For each basefile, download_single()
is called. This method is responsible for downloading everything
related to a single document. Most of the time, this is just a single
file, but can occasionally be a set of files (like a HTML document
with accompanying images, or a set of PDF files that conceptually is a
single document).
The default implementation of
download_single()
assumes that a
document is just a single file, and calculates the URL of that
document by calling the remote_url()
method.
The default remote_url()
method uses
the class variable
document_url_template
. This string
template should be using string formatting and expect a variable
called basefile
. The default implementation of
remote_url()
can in other words only
be used if the URLs of the remote source are predictable and directly
based on the basefile
.
Note
In many cases, the URL for the remote version of a document can be
impossible to calculate from the basefile only, but be readily
available from the main index page or search result page. For those
cases, download_get_basefiles()
should return a iterator that yields (basefile, url)
tuples. The default implementation of
download()
handles this and uses
url
as the second, optional argument to download_single.
Finally, the actual downloading of individual files is done by the
download_if_needed()
method. As the
name implies, this method tries to avoid downloading anything from the
network if it’s not strictly needed. If there is a file in-place
already, a conditional GET is done (using the timestamp of the file
for a If-modified-since
header, and an associated .etag file for a
If-none-match
header). This avoids re-downloading the (potentially
large) file if it hasn’t changed.
To summarize: The main chain of calls looks something like this:
download
start_url (class variable)
download_get_basefiles (instancemethod) - iterator
download_single (instancemethod)
remote_url (instancemethod)
document_url_template (class variable)
download_if_needed (instancemethod)
These are the methods that you may override, and when you might want to do so:
method | Default behaviour | Override when |
---|---|---|
download | Download the contents of
start_url and extracts all
links by lxml.html.iterlinks ,
which are passed to
download_get_basefiles .
For each item that is returned,
call download_single. |
All your documents are not linked
from a single index page (i.e. paged
search results). In these cases, you
should override
download_get_basefiles as well
and make that method responsible for
fetching all pages of search results. |
download_get_basefiles | Iterate through the (element,
attribute, link, url) tuples from
the source and examine if link
matches basefile_regex or if
url match document_url_regex .
If so, yield a
(text, url) tuple. |
The basefile/url extraction is more
complicated than what can be achieved
through the basefile_regex /
document_url_regex mechanism, or
when you’ve overridden download to
pass a different argument than a
link iterator. Note that you must
return an iterator by using the
yield statement for each basefile
found. |
download_single | Calculates the url of the document
to download (or, if a URL is
provided, uses that), and calls
download_if_needed with that.
Afterwards, updates the
DocumentEntry of the document
to reflect source url and download
timestamps. |
The complete contents of your
document is contained in several
different files. In these cases, you
should start with the main one and
call download_if_needed for that,
then calculate urls and file paths
(using the attachment parameter to
store.downloaded_path ) for each
additional file, then call
download_if_needed for each. Finally,
you must update the DocumentEntry
object. |
remote_url | Calculates a URL from a basename
using document_url_template |
The rules for producing a URL from a basefile is more complicated than what string formatting can achieve. |
download_if_needed | Downloads an individual URL to a local file. Makes sure the local file has the same timestamp as the Last-modified header from the server. If an older version of the file is present, this can either be archived (the default) or overwritten. | You really shouldn’t. |
The optional basefile argument¶
During early stages of development, it’s often useful to just download a single document, both in order to check out that download_single works as it should, and to have sample documents for parse. When using the ferenda-build.py tool, the download command can take a single optional parameter, ie.:
./ferenda-build.py rfc download 6725
If provided, this parameter is passed to the download method as the optional basefile parameter. The default implementation of download checks if this parameter is provided, and if so, simply calls download_single with that parameter, skipping the full download procedure. If you’re overriding download, you should support this usage, by starting your implementation with something like this:
def download(self, basefile=None):
if basefile:
return self.download_single(basefile)
# the rest of your code
The downloadmax()
decorator¶
As we saw in Introduction to Ferenda, the built-in docrepos support a
downloadmax
configuration parameter. The effect of this parameter
is simply to interrupt the downloading process after a certain amount
of documents have been downloaded. This can be useful when doing
integration-type testing, or if you just want to make it easy for
someone else to try out your docrepo class. The separation between the
main download()
method anbd the
download_get_basefiles()
helper
method makes this easy – just add the
@
downloadmax()
to the latter. This
decorator reads the downloadmax
configuration parameter (it also
looks for a FERENDA_DOWNLOADMAX
environment variable) and if set,
limits the number of basefiles returned by
download_get_basefiles()
.
Writing your own parse
implementation¶
The purpose of the
parse()
method is to take
the downloaded file(s) for a particular document and parse it into a
structured document with proper metadata, both for the document as a
whole, but also for individual sections of the document.
# In order to properly handle our RDF data, we need to tell
# ferenda which namespaces we'll be using. These will be available
# as rdflib.Namespace objects in the self.ns dict, which means you
# can state that something is eg. a dcterms:title by using
# self.ns['dcterms'].title. See
# :py:data:`~ferenda.DocumentRepository.namespaces`
namespaces = ('rdf', # always needed
'dcterms', # title, identifier, etc
'bibo', # Standard and DocumentPart classes, chapter prop
'xsd', # datatypes
'foaf', # rfcs are foaf:Documents for now
('rfc','http://example.org/ontology/rfc/')
)
from rdflib import Namespace
rdf_type = Namespace('http://example.org/ontology/rfc/').RFC
from ferenda.decorators import managedparsing
@managedparsing
def parse(self, doc):
# some very simple heuristic rules for determining
# what an individual paragraph is
def is_heading(p):
# If it's on a single line and it isn't indented with spaces
# it's probably a heading.
if p.count("\n") == 0 and not p.startswith(" "):
return True
def is_pagebreak(p):
# if it contains a form feed character, it represents a page break
return "\f" in p
# Parsing a document consists mainly of two parts:
# 1: First we parse the body of text and store it in doc.body
from ferenda.elements import Body, Preformatted, Title, Heading
from ferenda import Describer
reader = TextReader(self.store.downloaded_path(doc.basefile))
# First paragraph of an RFC is always a header block
header = reader.readparagraph()
# Preformatted is a ferenda.elements class representing a
# block of preformatted text. It is derived from the built-in
# list type, and must thus be initialized with an iterable, in
# this case a single-element list of strings. (Note: if you
# try to initialize it with a string, because strings are
# iterables as well, you'll end up with a list where each
# character in the string is an element, which is not what you
# want).
preheader = Preformatted([header])
# Doc.body is a ferenda.elements.Body class, which is also
# is derived from list, so it has (amongst others) the append
# method. We build our document by adding to this root
# element.
doc.body.append(preheader)
# Second paragraph is always the title, and we don't include
# this in the body of the document, since we'll add it to the
# medata -- once is enough
title = reader.readparagraph()
# After that, just iterate over the document and guess what
# everything is. TextReader.getiterator is useful for
# iterating through a text in other chunks than single lines
for para in reader.getiterator(reader.readparagraph):
if is_heading(para):
# Heading is yet another of these ferenda.elements
# classes.
doc.body.append(Heading([para]))
elif is_pagebreak(para):
# Just drop these remnants of a page-and-paper-based past
pass
else:
# If we don't know that it's something else, it's a
# preformatted section (the safest bet for RFC text).
doc.body.append(Preformatted([para]))
# 2: Then we create metadata for the document and store it in
# doc.meta (in this case using the convenience
# ferenda.Describer class).
desc = Describer(doc.meta, doc.uri)
# Set the rdf:type of the document
desc.rdftype(self.rdf_type)
# Set the title we've captured as the dcterms:title of the document and
# specify that it is in English
desc.value(self.ns['dcterms'].title, util.normalize_space(title), lang="en")
# Construct the dcterms:identifier (eg "RFC 6991") for this document from the basefile
desc.value(self.ns['dcterms'].identifier, "RFC " + doc.basefile)
# find and convert the publication date in the header to a datetime
# object, and set it as the dcterms:issued date for the document
re_date = re.compile("(January|February|March|April|May|June|July|August|September|October|November|December) (\d{4})").search
# This is a context manager that temporarily sets the system
# locale to the "C" locale in order to be able to use strptime
# with a string on the form "August 2013", even though the
# system may use another locale.
dt_match = re_date(header)
if dt_match:
with util.c_locale():
dt = datetime.strptime(re_date(header).group(0), "%B %Y")
pubdate = date(dt.year,dt.month,dt.day)
# Note that using some python types (cf. datetime.date)
# results in a datatyped RDF literal, ie in this case
# <http://localhost:8000/res/rfc/6994> dcterms:issued "2013-08-01"^^xsd:date
desc.value(self.ns['dcterms'].issued, pubdate)
# find any older RFCs that this document updates or obsoletes
obsoletes = re.search("^Obsoletes: ([\d+, ]+)", header, re.MULTILINE)
updates = re.search("^Updates: ([\d+, ]+)", header, re.MULTILINE)
# Find the category of this RFC, store it as dcterms:subject
cat_match = re.search("^Category: ([\w ]+?)( |$)", header, re.MULTILINE)
if cat_match:
desc.value(self.ns['dcterms'].subject, cat_match.group(1))
for predicate, matches in ((self.ns['rfc'].updates, updates),
(self.ns['rfc'].obsoletes, obsoletes)):
if matches is None:
continue
# add references between this document and these older rfcs,
# using either rfc:updates or rfc:obsoletes
for match in matches.group(1).strip().split(", "):
uri = self.canonical_uri(match)
# Note that this uses our own unofficial
# namespace/vocabulary
# http://example.org/ontology/rfc/
desc.rel(predicate, uri)
# And now we're done. We don't need to return anything as
# we've modified the Document object that was passed to
# us. The calling code will serialize this modified object to
# XHTML and RDF and store it on disk
This implementation builds a very simple object model of a RFC
document, which is serialized to a XHTML1.1+RDFa document by the
managedparsing()
decorator. If you
run it (by calling ferenda-build.py rfc parse --all
) after having
downloaded the rfc documents, the result will be a set of documents in
data/rfc/parsed
, and a set of RDF files in
data/rfc/distilled
. Take a look at them! The above might appear to
be a lot of code, but it also accomplishes much. Furthermore, it
should be obvious how to extend it, for instance to create more
metadata from the fields in the header (such as capturing the RFC
category, the publishing party, the authors etc) and better semantic
representation of the body (such as marking up regular paragraphs,
line drawings, bulleted lists, definition lists, EBNF definitions and
so on).
Next up, we’ll extend this implementation in two ways: First by representing the nested nature of the sections and subsections in the documents, secondly by finding and linking citations/references to other parts of the text or other RFCs in full.
Note
How does ./ferenda-build.py rfc parse --all
work? It calls
list_basefiles_for()
with the
argument parse
, which lists all downloaded files, and extracts
the basefile for each of them, then calls parse for each in turn.
Handling document structure¶
The main text of a RFC is structured into sections, which may contain subsections, which in turn can contain subsubsections. The start of each section is easy to identify, which means we can build a model of this structure by extending our parse method with relatively few lines:
from ferenda.elements import Section, Subsection, Subsubsection
# More heuristic rules: Section headers start at the beginning
# of a line and are numbered. Subsections and subsubsections
# have dotted numbers, optionally with a trailing period, ie
# '9.2.' or '11.3.1'
def is_section(p):
return re.match(r"\d+\.? +[A-Z]", p)
def is_subsection(p):
return re.match(r"\d+\.\d+\.? +[A-Z]", p)
def is_subsubsection(p):
return re.match(r"\d+\.\d+\.\d+\.? +[A-Z]", p)
def split_sectionheader(p):
# returns a tuple of title, ordinal, identifier
ordinal, title = p.split(" ",1)
ordinal = ordinal.strip(".")
return title.strip(), ordinal, "RFC %s, section %s" % (doc.basefile, ordinal)
# Use a list as a simple stack to keep track of the nesting
# depth of a document. Every time we create a Section,
# Subsection or Subsubsection object, we push it onto the
# stack (and clear the stack down to the appropriate nesting
# depth). Every time we create some other object, we append it
# to whatever object is at the top of the stack. As your rules
# for representing the nesting of structure become more
# complicated, you might want to use the
# :class:`~ferenda.FSMParser` class, which lets you define
# heuristic rules (recognizers), states and transitions, and
# takes care of putting your structure together.
stack = [doc.body]
for para in reader.getiterator(reader.readparagraph):
if is_section(para):
title, ordinal, identifier = split_sectionheader(para)
s = Section(title=title, ordinal=ordinal, identifier=identifier)
stack[1:] = [] # clear all but bottom element
stack[0].append(s) # add new section to body
stack.append(s) # push new section on top of stack
elif is_subsection(para):
title, ordinal, identifier = split_sectionheader(para)
s = Subsection(title=title, ordinal=ordinal, identifier=identifier)
stack[2:] = [] # clear all but bottom two elements
stack[1].append(s) # add new subsection to current section
stack.append(s)
elif is_subsubsection(para):
title, ordinal, identifier = split_sectionheader(para)
s = Subsubsection(title=title, ordinal=ordinal, identifier=identifier)
stack[3:] = [] # clear all but bottom three
stack[-1].append(s) # add new subsubsection to current subsection
stack.append(s)
elif is_heading(para):
stack[-1].append(Heading([para]))
elif is_pagebreak(para):
pass
else:
pre = Preformatted([para])
stack[-1].append(pre)
This enhances parse so that instead of outputting a single long list of elements directly under body
:
<h1>2. Overview</h1>
<h1>2.1. Date, Location, and Participants</h1>
<pre>
The second ForCES interoperability test meeting was held by the IETF
ForCES Working Group on February 24-25, 2011...
</pre>
<h1>2.2. Testbed Configuration</h1>
<h1>2.2.1. Participants' Access</h1>
<pre>
NTT and ZJSU were physically present for the testing at the Internet
Technology Lab (ITL) at Zhejiang Gongshang University in China.
</pre>
…we have a properly nested element structure, as well as much more metadata represented in RDFa form:
<div class="section" property="dcterms:title" content=" Overview"
typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.">
<span property="bibo:chapter" content="2."
about="http://localhost:8000/res/rfc/6984#S2."/>
<div class="subsection" property="dcterms:title" content=" Date, Location, and Participants"
typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.1.">
<span property="bibo:chapter" content="2.1."
about="http://localhost:8000/res/rfc/6984#S2.1."/>
<pre>
The second ForCES interoperability test meeting was held by the
IETF ForCES Working Group on February 24-25, 2011...
</pre>
<div class="subsection" property="dcterms:title" content=" Testbed Configuration"
typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.2.">
<span property="bibo:chapter" content="2.2."
about="http://localhost:8000/res/rfc/6984#S2.2."/>
<div class="subsubsection" property="dcterms:title" content=" Participants' Access"
typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.2.1.">
<span content="2.2.1." about="http://localhost:8000/res/rfc/6984#S2.2.1."
property="bibo:chapter"/>
<pre>
NTT and ZJSU were physically present for the testing at the
Internet Technology Lab (ITL) at Zhejiang Gongshang
University in China...
</pre>
</div>
</div>
</div>
</div>
Note in particular that every section and subsection now has a defined
URI (in the @about
attribute). This will be useful later.
Handling citations in text¶
References / citations in RFC text is often of the form "are to be
interpreted as described in [RFC2119]"
(for citations to other RFCs
in whole), "as described in Section 7.1"
(for citations to other
parts of the current document) or "Section 2.4 of [RFC2045] says"
(for citations to a specific part in another document). We can define
a simple grammar for these citations using pyparsing:
from pyparsing import Word, CaselessLiteral, nums
section_citation = (CaselessLiteral("section") + Word(nums+".").setResultsName("Sec")).setResultsName("SecRef")
rfc_citation = ("[RFC" + Word(nums).setResultsName("RFC") + "]").setResultsName("RFCRef")
section_rfc_citation = (section_citation + "of" + rfc_citation).setResultsName("SecRFCRef")
The above productions have named results for different parts of the citation, ie a citation of the form “Section 2.4 of [RFC2045] says” will result in the named matches Sec = “2.4” and RFC = “2045”. The CitationParser class can be used to extract these matches into a dict, which is then passed to a uri formatter function like:
def rfc_uriformatter(parts):
uri = ""
if 'RFC' in parts:
uri += self.canonical_uri(parts['RFC'].lstrip("0"))
if 'Sec' in parts:
uri += "#S" + parts['Sec']
return uri
And to initialize a citation parser and have it run over the entire structured text, finding citations and formatting them into URIs as we go along, just use:
from ferenda import CitationParser, URIFormatter
citparser = CitationParser(section_rfc_citation,
section_citation,
rfc_citation)
citparser.set_formatter(URIFormatter(("SecRFCRef", rfc_uriformatter),
("SecRef", rfc_uriformatter),
("RFCRef", rfc_uriformatter)))
citparser.parse_recursive(doc.body)
The result of these lines is that the following block of plain text:
<pre>
The behavior recommended in Section 2.5 is in line with generic error
treatment during the IKE_SA_INIT exchange, per Section 2.21.1 of
[RFC5996].
</pre>
…transform into this hyperlinked text:
<pre>
The behavior recommended in <a href="#S2.5"
rel="dcterms:references">Section 2.5</a> is in line with generic
error treatment during the IKE_SA_INIT exchange, per <a
href="http://localhost:8000/res/rfc/5996#S2.21.1"
rel="dcterms:references">Section 2.21.1 of [RFC5996]</a>.
</pre>
Note
The uri formatting function uses
canonical_uri()
to create the
base URI for each external reference. Proper design of the URIs
you’ll be using is a big topic, and you should think through what
URIs you want to use for your documents and their parts. Ferenda
provides a default implementation to create URIs from document
properties, but you might want to override this.
The parse step is probably the part of your application which you’ll spend the most time developing. You can start simple (like above) and then incrementally improve the end result by processing more metadata, model the semantic document structure better, and handle in-line references in text more correctly. See also Building structured documents, Parsing document structure and Citation parsing.
Calling relate()
¶
The purpose of the relate()
method is to make sure that all document data and metadata is properly
stored and indexed, so that it can be easily retrieved in later
steps. This consists of three steps: Loading all RDF metadata into a
triplestore, loading all document content into a full text index, and
making note of how documents refer to each other.
Since the output of parse is well structured XHTML+RDFa documents that, on the surface level, do not differ much from docrepo to docrepo, you should not have to change anything about this step.
Note
You might want to configure whether to load everything into a
fulltext index – this operation takes a lot of time, and this
index is not even used if createing a static site. You do this by
setting fulltextindex
to False
, either in ferenda.ini or on
the command line:
./ferenda-build.py rfc relate --all --fulltextindex=False
Calling makeresources()
¶
This method needs to run at some point before generate and the rest of
the methods. Unlike the other methods described above and below, which
are run for one docrepo at a time, this method is run for the project
as a whole (that is why it is a function in
ferenda.manager
instead of a
DocumentRepository
method). It constructs a set of
site-wide resources such as minified js and css files, and
configuration for the site-wide XSLT template. It is easy to run using
the command-line tool:
$ ./ferenda-build.py all makeresources
If you use the API, you need to provide a list of instances of the docrepos that you’re using, and the path to where generated resources should be stored:
from ferenda.manager import makeresources
config = {'datadir':'mydata'}
myrepos = [RFC(**config), W3C(**config]
makeresources(myrepos,'mydata/myresources')
Customizing generate()
¶
The purpose of the
generate()
method is to
create new browser-ready HTML files from the structured XHTML+RDFa
files created by
parse()
. Unlike the files
created by parse()
, these
files will contain site-branded headers, footers, navigation menus and
such. They will also contain related content not directly found in the
parsed files themselves: Sectioned documents will have a
automatically-generated table of contents, and other documents that
refer to a particular document will be listed in a sidebar in that
document. If the references are made to individual sections, there
will be sidebars for all such referenced sections.
The default implementation does this in two steps. In the first,
prep_annotation_file()
fetches metadata about other documents that relates to the document to
be generated into an annotation file. In the second,
Transformer
runs an
XSLT transformation on the source file (which sources the annotation
file and a configuration file created by
makeresources()
) in order to create the
browser-ready HTML file.
You should not need to override the general
generate()
method, but you might
want to control how the annotation file and the XSLT transformation is
done.
Getting annotations¶
The prep_annotation_file()
step is
driven by a SPARQL construct query. The default
query fetches metadata about every other document that refers to the
document (or sections thereof) you’re generating, using the
dcterms:references
predicate. By setting the class variable
sparql_annotations
to the file
name of SPARQL query file of your choice, you can override this query.
Since our metadata contains more specialized statements on how
document refer to each other, in the form of rfc:updates
and
rfc:obsoletes
statements, we want a query that’ll fetch this
metadata as well. When we query for metadata about a particular
document, we want to know if there is any other document that updates
or obsoletes this document. Using a CONSTRUCT query, we create
rfc:isUpdatedBy
and rfc:isObsoletedBy
references to such
documents. These queries are stored alongside the rest of the project
in separate .rq
files.
sparql_annotations = "rfc-annotations.rq"
The contents of the resource rfc-annotations.rq
, which should be
placed in a subdirectory named res
in the current directory,
should be:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX rfc: <http://example.org/ontology/rfc/>
CONSTRUCT {?s ?p ?o .
<%(uri)s> rfc:isObsoletedBy ?obsoleter .
<%(uri)s> rfc:isUpdatedBy ?updater .
<%(uri)s> dcterms:isReferencedBy ?referencer .
}
WHERE
{
# get all literal metadata where the document is the subject
{ ?s ?p ?o .
# FILTER(strstarts(str(?s), "%(uri)s"))
FILTER(?s = <%(uri)s> && !isUri(?o))
}
UNION
# get all metadata (except unrelated dcterms:references) about
# resources that dcterms:references the document or any of its
# sub-resources.
{ ?s dcterms:references+ <%(uri)s> ;
?p ?o .
BIND(?s as ?referencer)
FILTER(?p != dcterms:references || strstarts(str(?o), "%(uri)s"))
}
UNION
# get all metadata (except dcterms:references) about any resource that
# rfc:updates or rfc:obsolets the document
{ ?s ?x <%(uri)s> ;
?p ?o .
FILTER(?x in (rfc:updates, rfc:obsoletes) && ?p != dcterms:references)
}
# finally, bind obsoleting and updating resources to new variables for
# use in the CONSTRUCT clause
UNION { ?obsoleter rfc:obsoletes <%(uri)s> . }
UNION { ?updater rfc:updates <%(uri)s> . }
}
Note that %(uri)s
will be replaced with the URI for the document
we’re querying about.
Now, when querying the triplestore for metadata about RFC 6021, the (abbreviated) result is:
<graph xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:rfc="http://example.org/ontology/rfc/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<resource uri="http://localhost:8000/res/rfc/6021">
<rfc:isObsoletedBy ref="http://localhost:8000/res/rfc/6991"/>
<dcterms:published fmt="datatype">
<date xmlns="http://www.w3.org/2001/XMLSchema#">2010-10-01</date>
</dcterms:published>
<dcterms:title xml:lang="en">Common YANG Data Types</dcterms:title>
</resource>
<resource uri="http://localhost:8000/res/rfc/6991">
<a><rfc:RFC/></a>
<rfc:obsoletes ref="http://localhost:8000/res/rfc/6021"/>
<dcterms:published fmt="datatype">
<date xmlns="http://www.w3.org/2001/XMLSchema#">2013-07-01</date>
</dcterms:published>
<dcterms:title xml:lang="en">Common YANG Data Types</dcterms:title>
</resource>
</graph>
Note
You can find this file in data/rfc/annotations/6021.grit.xml
. It’s
in the Grit format for
easy inclusion in XSLT processing.
Even if you’re not familiar with the format, or with RDF in general, you can see that it contains information about two resources: first the document we’ve queried about (RFC 6021), then the document that obsoletes the same document (RFC 6991).
Note
If you’re coming from a relational database/SQL background, it can be a little difficult to come to grips with graph databases and SPARQL. The book “Learning SPARQL” by Bob DuCharme is highly recommended.
Transforming to HTML¶
The Transformer
step is driven by a XSLT
stylesheet. The default stylesheet uses a site-wide configuration file
(created by makeresources()
) for things like
site name and top-level navigation, and lists the document content,
section by section, alongside of other documents that contains
references (in the form of dcterms:references
) for each section. The
SPARQL query and the XSLT stylesheet often goes hand in hand – if
your stylesheet needs a certain piece of data, the query must be
adjusted to fetch it. By setting he class variable
xslt_template
in the same way as
you did for the SPARQL query, you can override the default.
xslt_template = "rfc.xsl"
The contents of the resource rfc.xsl
, which should be
placed in a subdirectory named res
in the current directory,
should be:
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
xmlns:xhtml="http://www.w3.org/1999/xhtml"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:rfc="http://example.org/ontology/rfc/"
xml:space="preserve"
exclude-result-prefixes="xhtml rdf">
<xsl:include href="base.xsl"/>
<!-- Implementations of templates called by base.xsl -->
<xsl:template name="headtitle"><xsl:value-of select="//xhtml:title"/> | <xsl:value-of select="$configuration/sitename"/></xsl:template>
<xsl:template name="metarobots"/>
<xsl:template name="linkalternate"/>
<xsl:template name="headmetadata"/>
<xsl:template name="bodyclass">rfc</xsl:template>
<xsl:template name="pagetitle">
<h1><xsl:value-of select="../xhtml:head/xhtml:title"/></h1>
</xsl:template>
<xsl:template match="xhtml:a"><a href="{@href}"><xsl:value-of select="."/></a></xsl:template>
<xsl:template match="xhtml:pre[1]">
<pre><xsl:apply-templates/>
</pre>
<xsl:if test="count(ancestor::*) = 2">
<xsl:call-template name="aside-annotations">
<xsl:with-param name="uri" select="../@about"/>
</xsl:call-template>
</xsl:if>
</xsl:template>
<!-- everything that has an @about attribute, i.e. _is_ something
(with a URI) gets a <section> with an <aside> for inbound links etc -->
<xsl:template match="xhtml:div[@about]">
<div class="section-wrapper" about="{@about}"><!-- needed? -->
<section id="{substring-after(@about,'#')}">
<xsl:variable name="sectionheading"><xsl:if test="xhtml:span[@property='bibo:chapter']/@content"><xsl:value-of select="xhtml:span[@property='bibo:chapter']/@content"/>. </xsl:if><xsl:value-of select="@content"/></xsl:variable>
<xsl:if test="count(ancestor::*) = 2">
<h2><xsl:value-of select="$sectionheading"/></h2>
</xsl:if>
<xsl:if test="count(ancestor::*) = 3">
<h3><xsl:value-of select="$sectionheading"/></h3>
</xsl:if>
<xsl:if test="count(ancestor::*) = 4">
<h4><xsl:value-of select="$sectionheading"/></h4>
</xsl:if>
<xsl:apply-templates select="*[not(@about)]"/>
</section>
<xsl:call-template name="aside-annotations">
<xsl:with-param name="uri" select="@about"/>
</xsl:call-template>
</div>
<xsl:apply-templates select="xhtml:div[@about]"/>
</xsl:template>
<!-- remove spans which only purpose is to contain RDFa data -->
<xsl:template match="xhtml:span[@property and @content and not(text())]"/>
<!-- construct the side navigation -->
<xsl:template match="xhtml:div[@about]" mode="toc">
<li><a href="#{substring-after(@about,'#')}"><xsl:if test="xhtml:span/@content"><xsl:value-of select="xhtml:span[@property='bibo:chapter']/@content"/>. </xsl:if><xsl:value-of select="@content"/></a><xsl:if test="xhtml:div[@about]">
<ul><xsl:apply-templates mode="toc"/></ul>
</xsl:if></li>
</xsl:template>
<!-- named template called from other templates which match
xhtml:div[@about] and pre[1] above, and which creates -->
<xsl:template name="aside-annotations">
<xsl:param name="uri"/>
<xsl:if test="$annotations/resource[@uri=$uri]/dcterms:isReferencedBy">
<aside class="annotations">
<h2>References to <xsl:value-of select="$annotations/resource[@uri=$uri]/dcterms:identifier"/></h2>
<xsl:for-each select="$annotations/resource[@uri=$uri]/rfc:isObsoletedBy">
<xsl:variable name="referencing" select="@ref"/>
Obsoleted by
<a href="{@ref}">
<xsl:value-of select="$annotations/resource[@uri=$referencing]/dcterms:identifier"/>
</a><br/>
</xsl:for-each>
<xsl:for-each select="$annotations/resource[@uri=$uri]/rfc:isUpdatedBy">
<xsl:variable name="referencing" select="@ref"/>
Updated by
<a href="{@ref}">
<xsl:value-of select="$annotations/resource[@uri=$referencing]/dcterms:identifier"/>
</a><br/>
</xsl:for-each>
<xsl:for-each select="$annotations/resource[@uri=$uri]/dcterms:isReferencedBy">
<xsl:variable name="referencing" select="@ref"/>
Referenced by
<a href="{@ref}">
<xsl:value-of select="$annotations/resource[@uri=$referencing]/dcterms:identifier"/>
</a><br/>
</xsl:for-each>
</aside>
</xsl:if>
</xsl:template>
<!-- default template: translate everything from whatever namespace
it's in (usually the XHTML1.1 NS) into the default namespace
-->
<xsl:template match="*"><xsl:element name="{local-name(.)}"><xsl:apply-templates select="node()"/></xsl:element></xsl:template>
<!-- default template for toc handling: do nothing -->
<xsl:template match="@*|node()" mode="toc"/>
</xsl:stylesheet>
This XSLT stylesheet depends on base.xsl
(which resides in
ferenda/res/xsl
in the source distribution of ferenda – take a
look if you want to know how everything fits together). The main
responsibility of this stylesheet is to format individual elements of
the document body.
base.xsl
takes care of the main chrome of the page, and it has a
default implementation (that basically transforms everything from
XHTML1.1 to HTML5, and removes some RDFa-only elements). It also loads
and provides the annotation file in the global variable
$annotations. The above XSLT stylesheet uses this to fetch information
about referencing documents. In particular, when processing an older
document, it lists if later documents have updated or obsoleted it
(see the named template aside-annotations
).
You might notice that this XSLT template flattens the nested structure of sections which we spent so much effort to create in the parse step. This is to make it easier to put up the aside boxes next to each part of the document, independent of the nesting level.
Note
While both the SPARQL query and the XSLT stylesheet might look complicated (and unless you’re a RDF/XSL expert, they are…), most of the time you can get a good result using the default generic query and stylesheet.
Customizing toc()
¶
The purpose of the toc()
method is to create a set of pages that acts as tables of contents for
all documents in your docrepo. For large document collections there
are often several different ways of creating such tables, eg. sorted
by title, publication date, document status, author and similar. The
pages uses the same site-branding,headers, footers, navigation menus
etc used by generate()
.
The default implementation is generic enough to handle most cases, but
you’ll have to override other methods which it calls, primarily
facets()
and
toc_item()
. These methods
depend on the metadata you’ve created by your parse implementation,
but in the simplest cases it’s enough to specify that you want one set
of pages organized by the dcterms:title
of each document
(alphabetically sorted) and another by dcterms:issued
(numerically/calendarically sorted). The default implementation does
exactly this.
In our case, we wish to create four kinds of sorting: By identifier
(RFC number), by date of issue, by title and by category. These map
directly to four kinds of metadata that we’ve stored about each and
every document. By overriding
facets()
we can specify these four
facets, aspects of documents used for grouping and sorting.
def facets(self):
from ferenda import Facet
return [Facet(self.ns['dcterms'].title),
Facet(self.ns['dcterms'].issued),
Facet(self.ns['dcterms'].subject),
Facet(self.ns['dcterms'].identifier)]
After running toc with this change, you can see that three sets of
index pages are created. By default, the dcterms:identifier
predicate isn’t used for the TOC pages, as it’s often derived from the
document title. Furthermore, you’ll get some error messages along the
lines of “Best Current Practice does not look like a valid URI”, which
is because the dcterms:subject
predicate normally should have URIs
as values, and we are using plain string literals.
We can fix both these problems by customizing our facet objects a
little. We specify that we wish to use dcterms:identifier
as a TOC
facet, and provide a simple method to group RFCs by their identifier
in groups of 100, ie one page for RFC 1-99, another for RFC 100-199,
and so on. We also specify that we expect our dcterms:subject
values to be plain strings.
def facets(self):
def select_rfcnum(row, binding, resource_graph):
# "RFC 6998" -> "6900"
return row[binding][4:-2] + "00"
from ferenda import Facet
return [Facet(self.ns['dcterms'].title),
Facet(self.ns['dcterms'].issued),
Facet(self.ns['dcterms'].subject,
selector=Facet.defaultselector,
identificator=Facet.defaultselector,
key=Facet.defaultselector),
Facet(self.ns['dcterms'].identifier,
use_for_toc=True,
selector=select_rfcnum,
pagetitle="RFC %(selected)s00-%(selected)s99")]
The above code gives some example of how Facet
objects can be configured. However, a Facet
object does not control how each individual document is listed on a
toc page. The default formatting just lists the title of the document,
linked to the document in question. For RFCs, who mainly is referenced
using their RFC number rather than their title, we’d like to add the
RFC number in this display. This is done by overriding
toc_item()
.
def toc_item(self, binding, row):
from ferenda.elements import Link
return [row['dcterms_identifier'] + ": ",
Link(row['dcterms_title'],
uri=row['uri'])]
Se also Customizing the table(s) of content and Grouping documents with facets.
Customizing news()
¶
The purpose of news()
,
the next to final step, is to provide a set of news feeds for your document
repository.
The default implementation gives you one single news feed for all
documents in your docrepo, and creates both browser-ready HTML (using
the same headers, footers, navigation menus etc used by
generate()
) and Atom
syndication format files.
The facets you’ve defined for your docrepo are re-used to create news
feeds for eg. all documents published by a particular entity, or all
documents of a certain type. Only facet objects which has the
use_for_feed
property set to a truthy value are used to construct
newsfeeds.
In this example, we adjust the facet based on dcterms:subject
so
that it can be used for newsfeed generation.
def facets(self):
def select_rfcnum(row, binding, resource_graph):
# "RFC 6998" -> "6900"
return row[binding][4:-2] + "00"
from ferenda import Facet
return [Facet(self.ns['dcterms'].title),
Facet(self.ns['dcterms'].issued),
Facet(self.ns['dcterms'].subject,
selector=Facet.defaultselector,
identificator=Facet.defaultidentificator,
key=Facet.defaultselector,
use_for_feed=True),
Facet(self.ns['dcterms'].identifier,
use_for_toc=True,
selector=select_rfcnum,
pagetitle="RFC %(selected)s00-%(selected)s99")]
When running news
, this will create five different atom feeds
(which are mirrored as HTML pages) under data/rfc/news
: One
containing all documents, and four others that contain documents in a
particular category (eg having a particular dcterms:subject
value.
Note
As you can see, the resulting HTML pages are a little rough around the edges. Also, there isn’t currently any way of discovering the Atom feeds or HTML pages from the main site – you need to know the URLs. This will all be fixed in due time.
Se also Customizing the news feeds.
Customizing frontpage()
¶
Finally, frontpage()
creates a front page for
your entire site with content from the different docrepos. Each
docrepos frontpage_content()
method
will be called, and should return a XHTML fragment with information
about the repository and it’s content. Below is a simple example that
uses functionality we’ve used in other contexts to create a list of
the five latest documents, as well as a total count of documents.
def frontpage_content(self, primary=False):
from rdflib import URIRef, Graph
from itertools import islice
items = ""
for entry in islice(self.news_entries(),5):
graph = Graph()
with self.store.open_distilled(entry.basefile) as fp:
graph.parse(data=fp.read())
data = {'identifier': graph.value(URIRef(entry.id), self.ns['dcterms'].identifier).toPython(),
'uri': entry.id,
'title': entry.title}
items += '<li>%(identifier)s <a href="%(uri)s">%(title)s</a></li>' % data
return ("""<h2><a href="%(uri)s">Request for comments</a></h2>
<p>A complete archive of RFCs in Linked Data form. Contains %(doccount)s documents.</p>
<p>Latest 5 documents:</p>
<ul>
%(items)s
</ul>""" % {'uri':self.dataset_uri(),
'items': items,
'doccount': len(list(self.store.list_basefiles_for("_postgenerate")))})
Next steps¶
When you have written code and customized downloading, parsing and all
the other steps, you’ll want to run all these steps for all your
docrepos in a single command by using the special value all
for
docrepo, and again all
for action:
./ferenda-build.py all all
By now, you should have a basic idea about the key concepts of ferenda. In the next section, Key concepts, we’ll explore them further.