Creating your own document repositories¶

The next step is to do more substantial adjustments to the download/parse/generate cycle. As the source for our next docrepo we’ll use the collected RFCs, as published by IETF. These documents are mainly available in plain text format (formatted for printing on a line printer), as is the document index itself. This means that we cannot rely on the default implementation of download and parse. Furthermore, RFCs are categorized and refer to each other using varying semantics. This metadata can be captured, queried and used in a number of ways to present the RFC collection in a better way.

Writing your own `download` implementation¶

The purpose of the download() method is to fetch source documents from a remote source and store them locally, possibly under different filenames but otherwise bit-for-bit identical with how they were stored at the remote source (see File storage for more information about how and where files are stored locally).

The default implementation of download() uses a small number of methods and class variables to do the actual work. By selectively overriding these, you can often avoid rewriting a complete implementation of download().

A simple example¶

We’ll start out by creating a class similar to our W3C class in First steps. All RFC documents are listed in the index file at http://www.ietf.org/download/rfc-index.txt, while a individual document (such as RFC 6725) are available at http://tools.ietf.org/rfc/rfc6725.txt. Our first attempt will look like this (save as rfcs.py)

import re
from datetime import datetime, date

import requests

from ferenda import DocumentRepository, TextReader
from ferenda import util
from ferenda.decorators import downloadmax

class RFCs(DocumentRepository):
    alias = "rfc"
    start_url = "http://www.ietf.org/download/rfc-index.txt"
    document_url_template = "http://tools.ietf.org/rfc/rfc%(basefile)s.txt"
    downloaded_suffix = ".txt"

And we’ll enable it and try to run it like before:

$ ./ferenda-build.py rfcs.RFCs enable
$ ./ferenda-build.py rfc download

This doesn’t work! This is because start page contains no actual HTML links – it’s a plaintext file. We need to parse the index text file to find out all available basefiles. In order to do that, we must override download().

    def download(self):
        self.log.debug("download: Start at %s" %  self.start_url)
        indextext = requests.get(self.start_url).text
        reader = TextReader(string=indextext)  # see TextReader class
        iterator = reader.getiterator(reader.readparagraph)
        if not isinstance(self.config.downloadmax, (int, type(None))):
            self.config.downloadmax = int(self.config.downloadmax)
            
        for basefile in self.download_get_basefiles(iterator):
            self.download_single(basefile)

    @downloadmax
    def download_get_basefiles(self, source):
        for p in reversed(list(source)):
            if re.match("^(\d{4}) ",p): # looks like a RFC number
                if not "Not Issued." in p: # Skip RFC known to not exist
                    basefile = str(int(p[:4]))  # eg. '0822' -> '822'
                    yield basefile

Since the RFC index is a plain text file, we use the TextReader class, which contains a bunch of functionality to make it easier to work with plain text files. In this case, we’ll iterate through the file one paragraph at a time, and if the paragraph starts with a four-digit number (and the number hasn’t been marked “Not Issued.”) we’ll download it by calling download_single().

Like the default implementation, we offload the main work to download_single(), which will look if the file exists on disk and only if not, attempt to download it. If the --refresh parameter is provided, a conditional get is performed and only if the server says the document has changed, it is re-downloaded.

Note

In many cases, the URL for the downloaded document is not easily constructed from a basefile identifier. download_single() therefore takes a optional url argument. The above could be written more verbosely like:

url = "http://tools.ietf.org/rfc/rfc%s.txt" % basefile
self.download_single(basefile, url)

In other cases, a document to be downloaded could consists of several resources (eg. a HTML document with images, or a PDF document with the actual content combined with a HTML document with document metadata). For these cases, you need to override download_single().

The main flow of the download process¶

The main flow is that the download() method itself does some source-specific setup, which often include downloading some sort of index or search results page. The location of that index resource is given by the class variable start_url. download() then calls download_get_basefiles() which returns an iterator of basefiles.

For each basefile, download_single() is called. This method is responsible for downloading everything related to a single document. Most of the time, this is just a single file, but can occasionally be a set of files (like a HTML document with accompanying images, or a set of PDF files that conceptually is a single document).

The default implementation of download_single() assumes that a document is just a single file, and calculates the URL of that document by calling the remote_url() method.

The default remote_url() method uses the class variable document_url_template. This string template should be using string formatting and expect a variable called basefile. The default implementation of remote_url() can in other words only be used if the URLs of the remote source are predictable and directly based on the basefile.

Note

In many cases, the URL for the remote version of a document can be impossible to calculate from the basefile only, but be readily available from the main index page or search result page. For those cases, download_get_basefiles() should return a iterator that yields (basefile, url) tuples. The default implementation of download() handles this and uses url as the second, optional argument to download_single.

Finally, the actual downloading of individual files is done by the download_if_needed() method. As the name implies, this method tries to avoid downloading anything from the network if it’s not strictly needed. If there is a file in-place already, a conditional GET is done (using the timestamp of the file for a If-modified-since header, and an associated .etag file for a If-none-match header). This avoids re-downloading the (potentially large) file if it hasn’t changed.

To summarize: The main chain of calls looks something like this:

download
  start_url (class variable)
  download_get_basefiles (instancemethod) - iterator
  download_single (instancemethod)
     remote_url (instancemethod)
         document_url_template (class variable)
     download_if_needed (instancemethod)

These are the methods that you may override, and when you might want to do so:

method	Default behaviour	Override when
download	Download the contents of `start_url` and extracts all links by `lxml.html.iterlinks`, which are passed to `download_get_basefiles`. For each item that is returned, call download_single.	All your documents are not linked from a single index page (i.e. paged search results). In these cases, you should override `download_get_basefiles` as well and make that method responsible for fetching all pages of search results.
download_get_basefiles	Iterate through the (element, attribute, link, url) tuples from the source and examine if link matches `basefile_regex` or if url match `document_url_regex`. If so, yield a (text, url) tuple.	The basefile/url extraction is more complicated than what can be achieved through the `basefile_regex` / `document_url_regex` mechanism, or when you’ve overridden download to pass a different argument than a link iterator. Note that you must return an iterator by using the `yield` statement for each basefile found.
download_single	Calculates the url of the document to download (or, if a URL is provided, uses that), and calls `download_if_needed` with that. Afterwards, updates the `DocumentEntry` of the document to reflect source url and download timestamps.	The complete contents of your document is contained in several different files. In these cases, you should start with the main one and call `download_if_needed` for that, then calculate urls and file paths (using the `attachment` parameter to `store.downloaded_path`) for each additional file, then call `download_if_needed` for each. Finally, you must update the `DocumentEntry` object.
remote_url	Calculates a URL from a basename using `document_url_template`	The rules for producing a URL from a basefile is more complicated than what string formatting can achieve.
download_if_needed	Downloads an individual URL to a local file. Makes sure the local file has the same timestamp as the Last-modified header from the server. If an older version of the file is present, this can either be archived (the default) or overwritten.	You really shouldn’t.

The optional basefile argument¶

During early stages of development, it’s often useful to just download a single document, both in order to check out that download_single works as it should, and to have sample documents for parse. When using the ferenda-build.py tool, the download command can take a single optional parameter, ie.:

./ferenda-build.py rfc download 6725

If provided, this parameter is passed to the download method as the optional basefile parameter. The default implementation of download checks if this parameter is provided, and if so, simply calls download_single with that parameter, skipping the full download procedure. If you’re overriding download, you should support this usage, by starting your implementation with something like this:

def download(self, basefile=None):
    if basefile:
        return self.download_single(basefile)

    # the rest of your code

The `downloadmax()` decorator¶

As we saw in Introduction to Ferenda, the built-in docrepos support a downloadmax configuration parameter. The effect of this parameter is simply to interrupt the downloading process after a certain amount of documents have been downloaded. This can be useful when doing integration-type testing, or if you just want to make it easy for someone else to try out your docrepo class. The separation between the main download() method anbd the download_get_basefiles() helper method makes this easy – just add the @downloadmax() to the latter. This decorator reads the downloadmax configuration parameter (it also looks for a FERENDA_DOWNLOADMAX environment variable) and if set, limits the number of basefiles returned by download_get_basefiles().

Writing your own `parse` implementation¶

The purpose of the parse() method is to take the downloaded file(s) for a particular document and parse it into a structured document with proper metadata, both for the document as a whole, but also for individual sections of the document.

    # In order to properly handle our RDF data, we need to tell
    # ferenda which namespaces we'll be using. These will be available
    # as rdflib.Namespace objects in the self.ns dict, which means you
    # can state that something is eg. a dcterms:title by using
    # self.ns['dcterms'].title. See
    # :py:data:`~ferenda.DocumentRepository.namespaces`
    namespaces = ('rdf',  # always needed
                  'dcterms',  # title, identifier, etc
                  'bibo', # Standard and DocumentPart classes, chapter prop
                  'xsd',  # datatypes
                  'foaf', # rfcs are foaf:Documents for now
                  ('rfc','http://example.org/ontology/rfc/')
                  )
    from rdflib import Namespace
    rdf_type = Namespace('http://example.org/ontology/rfc/').RFC

    
    from ferenda.decorators import managedparsing

    @managedparsing
    def parse(self, doc):
        # some very simple heuristic rules for determining 
        # what an individual paragraph is
   
        def is_heading(p):
            # If it's on a single line and it isn't indented with spaces
            # it's probably a heading.
            if p.count("\n") == 0 and not p.startswith(" "):
                return True
  
        def is_pagebreak(p):
            # if it contains a form feed character, it represents a page break
            return "\f" in p
        
        # Parsing a document consists mainly of two parts:
        # 1: First we parse the body of text and store it in doc.body
        from ferenda.elements import Body, Preformatted, Title, Heading
        from ferenda import Describer
        reader = TextReader(self.store.downloaded_path(doc.basefile))
  
        # First paragraph of an RFC is always a header block 
        header = reader.readparagraph()
        # Preformatted is a ferenda.elements class representing a
        # block of preformatted text. It is derived from the built-in
        # list type, and must thus be initialized with an iterable, in
        # this case a single-element list of strings. (Note: if you
        # try to initialize it with a string, because strings are
        # iterables as well, you'll end up with a list where each
        # character in the string is an element, which is not what you
        # want).
        preheader = Preformatted([header])
        # Doc.body is a ferenda.elements.Body class, which is also
        # is derived from list, so it has (amongst others) the append
        # method. We build our document by adding to this root
        # element.
        doc.body.append(preheader)
  
        # Second paragraph is always the title, and we don't include
        # this in the body of the document, since we'll add it to the
        # medata -- once is enough
        title = reader.readparagraph()
        
        # After that, just iterate over the document and guess what
        # everything is. TextReader.getiterator is useful for
        # iterating through a text in other chunks than single lines
        for para in reader.getiterator(reader.readparagraph):
            if is_heading(para):
                # Heading is yet another of these ferenda.elements
                # classes.
                doc.body.append(Heading([para]))
            elif is_pagebreak(para):
                # Just drop these remnants of a page-and-paper-based past
                pass
            else:
                # If we don't know that it's something else, it's a
                # preformatted section (the safest bet for RFC text).
                doc.body.append(Preformatted([para])) 

        # 2: Then we create metadata for the document and store it in
        # doc.meta (in this case using the convenience
        # ferenda.Describer class).

        desc = Describer(doc.meta, doc.uri)

        # Set the rdf:type of the document
        desc.rdftype(self.rdf_type)

        # Set the title we've captured as the dcterms:title of the document and 
        # specify that it is in English
        desc.value(self.ns['dcterms'].title, util.normalize_space(title), lang="en")

        # Construct the dcterms:identifier (eg "RFC 6991") for this document from the basefile
        desc.value(self.ns['dcterms'].identifier, "RFC " + doc.basefile)
  
        # find and convert the publication date in the header to a datetime 
        # object, and set it as the dcterms:issued date for the document   
        re_date = re.compile("(January|February|March|April|May|June|July|August|September|October|November|December) (\d{4})").search
        # This is a context manager that temporarily sets the system
        # locale to the "C" locale in order to be able to use strptime
        # with a string on the form "August 2013", even though the
        # system may use another locale.
        dt_match = re_date(header)
        if dt_match:
            with util.c_locale(): 
                dt = datetime.strptime(re_date(header).group(0), "%B %Y")
            pubdate = date(dt.year,dt.month,dt.day)
            # Note that using some python types (cf. datetime.date)
            # results in a datatyped RDF literal, ie in this case
            #   <http://localhost:8000/res/rfc/6994> dcterms:issued "2013-08-01"^^xsd:date
            desc.value(self.ns['dcterms'].issued, pubdate)
  
        # find any older RFCs that this document updates or obsoletes
        obsoletes = re.search("^Obsoletes: ([\d+, ]+)", header, re.MULTILINE)
        updates = re.search("^Updates: ([\d+, ]+)", header, re.MULTILINE)

        # Find the category of this RFC, store it as dcterms:subject
        cat_match = re.search("^Category: ([\w ]+?)(  |$)", header, re.MULTILINE)
        if cat_match:
            desc.value(self.ns['dcterms'].subject, cat_match.group(1))
            
        for predicate, matches in ((self.ns['rfc'].updates, updates),
                                   (self.ns['rfc'].obsoletes, obsoletes)):
            if matches is None:
                continue
            # add references between this document and these older rfcs, 
            # using either rfc:updates or rfc:obsoletes
            for match in matches.group(1).strip().split(", "):
                uri = self.canonical_uri(match)
                # Note that this uses our own unofficial
                # namespace/vocabulary
                # http://example.org/ontology/rfc/
                desc.rel(predicate, uri)
  
        # And now we're done. We don't need to return anything as
        # we've modified the Document object that was passed to
        # us. The calling code will serialize this modified object to
        # XHTML and RDF and store it on disk

This implementation builds a very simple object model of a RFC document, which is serialized to a XHTML1.1+RDFa document by the managedparsing() decorator. If you run it (by calling ferenda-build.py rfc parse --all) after having downloaded the rfc documents, the result will be a set of documents in data/rfc/parsed, and a set of RDF files in data/rfc/distilled. Take a look at them! The above might appear to be a lot of code, but it also accomplishes much. Furthermore, it should be obvious how to extend it, for instance to create more metadata from the fields in the header (such as capturing the RFC category, the publishing party, the authors etc) and better semantic representation of the body (such as marking up regular paragraphs, line drawings, bulleted lists, definition lists, EBNF definitions and so on).

Next up, we’ll extend this implementation in two ways: First by representing the nested nature of the sections and subsections in the documents, secondly by finding and linking citations/references to other parts of the text or other RFCs in full.

Note

How does ./ferenda-build.py rfc parse --all work? It calls list_basefiles_for() with the argument parse, which lists all downloaded files, and extracts the basefile for each of them, then calls parse for each in turn.

Handling document structure¶

The main text of a RFC is structured into sections, which may contain subsections, which in turn can contain subsubsections. The start of each section is easy to identify, which means we can build a model of this structure by extending our parse method with relatively few lines:

        from ferenda.elements import Section, Subsection, Subsubsection

        # More heuristic rules: Section headers start at the beginning
        # of a line and are numbered. Subsections and subsubsections
        # have dotted numbers, optionally with a trailing period, ie
        # '9.2.' or '11.3.1'
        def is_section(p):
            return re.match(r"\d+\.? +[A-Z]", p)

        def is_subsection(p):
            return re.match(r"\d+\.\d+\.? +[A-Z]", p)

        def is_subsubsection(p):
            return re.match(r"\d+\.\d+\.\d+\.? +[A-Z]", p)

        def split_sectionheader(p):
            # returns a tuple of title, ordinal, identifier
            ordinal, title = p.split(" ",1)
            ordinal = ordinal.strip(".")
            return title.strip(), ordinal, "RFC %s, section %s" % (doc.basefile, ordinal)

        # Use a list as a simple stack to keep track of the nesting
        # depth of a document. Every time we create a Section,
        # Subsection or Subsubsection object, we push it onto the
        # stack (and clear the stack down to the appropriate nesting
        # depth). Every time we create some other object, we append it
        # to whatever object is at the top of the stack. As your rules
        # for representing the nesting of structure become more
        # complicated, you might want to use the
        # :class:`~ferenda.FSMParser` class, which lets you define
        # heuristic rules (recognizers), states and transitions, and
        # takes care of putting your structure together.
        stack = [doc.body]

        for para in reader.getiterator(reader.readparagraph):
            if is_section(para):
                title, ordinal, identifier = split_sectionheader(para)
                s = Section(title=title, ordinal=ordinal, identifier=identifier)
                stack[1:] = [] # clear all but bottom element
                stack[0].append(s) # add new section to body
                stack.append(s)    # push new section on top of stack
            elif is_subsection(para):
                title, ordinal, identifier = split_sectionheader(para)
                s = Subsection(title=title, ordinal=ordinal, identifier=identifier)
                stack[2:] = [] # clear all but bottom two elements
                stack[1].append(s) # add new subsection to current section
                stack.append(s)
            elif is_subsubsection(para):
                title, ordinal, identifier = split_sectionheader(para)
                s = Subsubsection(title=title, ordinal=ordinal, identifier=identifier)
                stack[3:] = [] # clear all but bottom three
                stack[-1].append(s) # add new subsubsection to current subsection
                stack.append(s)
            elif is_heading(para):
                stack[-1].append(Heading([para]))
            elif is_pagebreak(para):
                pass
            else:
                pre = Preformatted([para])
                stack[-1].append(pre)

This enhances parse so that instead of outputting a single long list of elements directly under body:

<h1>2.  Overview</h1>
<h1>2.1.  Date, Location, and Participants</h1>
<pre>
   The second ForCES interoperability test meeting was held by the IETF
   ForCES Working Group on February 24-25, 2011...
</pre>
<h1>2.2.  Testbed Configuration</h1>
<h1>2.2.1.  Participants' Access</h1>
<pre>
   NTT and ZJSU were physically present for the testing at the Internet
  Technology Lab (ITL) at Zhejiang Gongshang University in China.
</pre>

…we have a properly nested element structure, as well as much more metadata represented in RDFa form:

<div class="section" property="dcterms:title" content=" Overview"
     typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.">
  <span property="bibo:chapter" content="2."
        about="http://localhost:8000/res/rfc/6984#S2."/>
  <div class="subsection" property="dcterms:title" content=" Date, Location, and Participants"
       typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.1.">
    <span property="bibo:chapter" content="2.1."
          about="http://localhost:8000/res/rfc/6984#S2.1."/>
    <pre>
      The second ForCES interoperability test meeting was held by the
      IETF ForCES Working Group on February 24-25, 2011...
    </pre>
    <div class="subsection" property="dcterms:title" content=" Testbed Configuration"
         typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.2.">
      <span property="bibo:chapter" content="2.2."
            about="http://localhost:8000/res/rfc/6984#S2.2."/>
      <div class="subsubsection" property="dcterms:title" content=" Participants' Access"
           typeof="bibo:DocumentPart" about="http://localhost:8000/res/rfc/6984#S2.2.1.">
        <span content="2.2.1." about="http://localhost:8000/res/rfc/6984#S2.2.1."
              property="bibo:chapter"/>
        <pre>
          NTT and ZJSU were physically present for the testing at the
          Internet Technology Lab (ITL) at Zhejiang Gongshang
          University in China...
        </pre>
      </div>
    </div>
  </div>
</div>

Note in particular that every section and subsection now has a defined URI (in the @about attribute). This will be useful later.

Handling citations in text¶

References / citations in RFC text is often of the form "are to be interpreted as described in [RFC2119]" (for citations to other RFCs in whole), "as described in Section 7.1" (for citations to other parts of the current document) or "Section 2.4 of [RFC2045] says" (for citations to a specific part in another document). We can define a simple grammar for these citations using pyparsing:

        from pyparsing import Word, CaselessLiteral, nums
        section_citation = (CaselessLiteral("section") + Word(nums+".").setResultsName("Sec")).setResultsName("SecRef")
        rfc_citation = ("[RFC" + Word(nums).setResultsName("RFC") + "]").setResultsName("RFCRef")
        section_rfc_citation = (section_citation + "of" + rfc_citation).setResultsName("SecRFCRef")

The above productions have named results for different parts of the citation, ie a citation of the form “Section 2.4 of [RFC2045] says” will result in the named matches Sec = “2.4” and RFC = “2045”. The CitationParser class can be used to extract these matches into a dict, which is then passed to a uri formatter function like:

        def rfc_uriformatter(parts):
            uri = ""
            if 'RFC' in parts:
                 uri += self.canonical_uri(parts['RFC'].lstrip("0"))
            if 'Sec' in parts:
                 uri += "#S" + parts['Sec']
            return uri

And to initialize a citation parser and have it run over the entire structured text, finding citations and formatting them into URIs as we go along, just use:

        from ferenda import CitationParser, URIFormatter
        citparser = CitationParser(section_rfc_citation, 
                                   section_citation,
                                   rfc_citation)
        citparser.set_formatter(URIFormatter(("SecRFCRef", rfc_uriformatter),
                                             ("SecRef", rfc_uriformatter),
                                             ("RFCRef", rfc_uriformatter)))
        citparser.parse_recursive(doc.body)

The result of these lines is that the following block of plain text:

<pre>
   The behavior recommended in Section 2.5 is in line with generic error
   treatment during the IKE_SA_INIT exchange, per Section 2.21.1 of
   [RFC5996].
</pre>

…transform into this hyperlinked text:

<pre>
   The behavior recommended in <a href="#S2.5"
   rel="dcterms:references">Section 2.5</a> is in line with generic
   error treatment during the IKE_SA_INIT exchange, per <a
   href="http://localhost:8000/res/rfc/5996#S2.21.1"
   rel="dcterms:references">Section 2.21.1 of [RFC5996]</a>.
</pre>

Note

The uri formatting function uses canonical_uri() to create the base URI for each external reference. Proper design of the URIs you’ll be using is a big topic, and you should think through what URIs you want to use for your documents and their parts. Ferenda provides a default implementation to create URIs from document properties, but you might want to override this.

The parse step is probably the part of your application which you’ll spend the most time developing. You can start simple (like above) and then incrementally improve the end result by processing more metadata, model the semantic document structure better, and handle in-line references in text more correctly. See also Building structured documents, Parsing document structure and Citation parsing.

Calling `relate()`¶

The purpose of the relate() method is to make sure that all document data and metadata is properly stored and indexed, so that it can be easily retrieved in later steps. This consists of three steps: Loading all RDF metadata into a triplestore, loading all document content into a full text index, and making note of how documents refer to each other.

Since the output of parse is well structured XHTML+RDFa documents that, on the surface level, do not differ much from docrepo to docrepo, you should not have to change anything about this step.

Note

You might want to configure whether to load everything into a fulltext index – this operation takes a lot of time, and this index is not even used if createing a static site. You do this by setting fulltextindex to False, either in ferenda.ini or on the command line:

./ferenda-build.py rfc relate --all --fulltextindex=False

Calling `makeresources()`¶

This method needs to run at some point before generate and the rest of the methods. Unlike the other methods described above and below, which are run for one docrepo at a time, this method is run for the project as a whole (that is why it is a function in ferenda.manager instead of a DocumentRepository method). It constructs a set of site-wide resources such as minified js and css files, and configuration for the site-wide XSLT template. It is easy to run using the command-line tool:

$ ./ferenda-build.py all makeresources

If you use the API, you need to provide a list of instances of the docrepos that you’re using, and the path to where generated resources should be stored:

from ferenda.manager import makeresources
config = {'datadir':'mydata'}
myrepos = [RFC(**config), W3C(**config]
makeresources(myrepos,'mydata/myresources')

Customizing `generate()`¶

The purpose of the generate() method is to create new browser-ready HTML files from the structured XHTML+RDFa files created by parse(). Unlike the files created by parse(), these files will contain site-branded headers, footers, navigation menus and such. They will also contain related content not directly found in the parsed files themselves: Sectioned documents will have a automatically-generated table of contents, and other documents that refer to a particular document will be listed in a sidebar in that document. If the references are made to individual sections, there will be sidebars for all such referenced sections.

The default implementation does this in two steps. In the first, prep_annotation_file() fetches metadata about other documents that relates to the document to be generated into an annotation file. In the second, Transformer runs an XSLT transformation on the source file (which sources the annotation file and a configuration file created by makeresources()) in order to create the browser-ready HTML file.

You should not need to override the general generate() method, but you might want to control how the annotation file and the XSLT transformation is done.

Getting annotations¶

The prep_annotation_file() step is driven by a SPARQL construct query. The default query fetches metadata about every other document that refers to the document (or sections thereof) you’re generating, using the dcterms:references predicate. By setting the class variable sparql_annotations to the file name of SPARQL query file of your choice, you can override this query.

Since our metadata contains more specialized statements on how document refer to each other, in the form of rfc:updates and rfc:obsoletes statements, we want a query that’ll fetch this metadata as well. When we query for metadata about a particular document, we want to know if there is any other document that updates or obsoletes this document. Using a CONSTRUCT query, we create rfc:isUpdatedBy and rfc:isObsoletedBy references to such documents. These queries are stored alongside the rest of the project in separate .rq files.

    sparql_annotations = "rfc-annotations.rq"

The contents of the resource rfc-annotations.rq, which should be placed in a subdirectory named res in the current directory, should be:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX rfc: <http://example.org/ontology/rfc/>

CONSTRUCT {?s ?p ?o .
           <%(uri)s> rfc:isObsoletedBy ?obsoleter .
	   <%(uri)s> rfc:isUpdatedBy ?updater .
	   <%(uri)s> dcterms:isReferencedBy ?referencer .
	  }
WHERE
{
   # get all literal metadata where the document is the subject
   { ?s ?p ?o .
     # FILTER(strstarts(str(?s), "%(uri)s"))
     FILTER(?s = <%(uri)s> && !isUri(?o))
   }
   UNION
   # get all metadata (except unrelated dcterms:references) about
   #  resources that dcterms:references the document or any of its
   #  sub-resources.
   { ?s dcterms:references+ <%(uri)s> ;
        ?p ?o .
     BIND(?s as ?referencer)
     FILTER(?p != dcterms:references || strstarts(str(?o), "%(uri)s"))
   }
   UNION
   # get all metadata (except dcterms:references) about any resource that
   # rfc:updates or rfc:obsolets the document
   { ?s ?x <%(uri)s> ;
        ?p ?o .
     FILTER(?x in (rfc:updates, rfc:obsoletes) && ?p != dcterms:references)
   }
   # finally, bind obsoleting and updating resources to new variables for
   # use in the CONSTRUCT clause
   UNION { ?obsoleter rfc:obsoletes <%(uri)s> . }
   UNION { ?updater   rfc:updates   <%(uri)s> . }
}

Note that %(uri)s will be replaced with the URI for the document we’re querying about.

Now, when querying the triplestore for metadata about RFC 6021, the (abbreviated) result is:

<graph xmlns:dcterms="http://purl.org/dc/terms/"
       xmlns:rfc="http://example.org/ontology/rfc/"
       xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <resource uri="http://localhost:8000/res/rfc/6021">
    <rfc:isObsoletedBy ref="http://localhost:8000/res/rfc/6991"/>
    <dcterms:published fmt="datatype">
      <date xmlns="http://www.w3.org/2001/XMLSchema#">2010-10-01</date>
    </dcterms:published>
    <dcterms:title xml:lang="en">Common YANG Data Types</dcterms:title>
  </resource>
  <resource uri="http://localhost:8000/res/rfc/6991">
    <a><rfc:RFC/></a>
    <rfc:obsoletes ref="http://localhost:8000/res/rfc/6021"/>
    <dcterms:published fmt="datatype">
      <date xmlns="http://www.w3.org/2001/XMLSchema#">2013-07-01</date>
    </dcterms:published>
    <dcterms:title xml:lang="en">Common YANG Data Types</dcterms:title>
  </resource>
</graph>

Note

You can find this file in data/rfc/annotations/6021.grit.xml. It’s in the Grit format for easy inclusion in XSLT processing.

Even if you’re not familiar with the format, or with RDF in general, you can see that it contains information about two resources: first the document we’ve queried about (RFC 6021), then the document that obsoletes the same document (RFC 6991).

Note

If you’re coming from a relational database/SQL background, it can be a little difficult to come to grips with graph databases and SPARQL. The book “Learning SPARQL” by Bob DuCharme is highly recommended.

Transforming to HTML¶

The Transformer step is driven by a XSLT stylesheet. The default stylesheet uses a site-wide configuration file (created by makeresources()) for things like site name and top-level navigation, and lists the document content, section by section, alongside of other documents that contains references (in the form of dcterms:references) for each section. The SPARQL query and the XSLT stylesheet often goes hand in hand – if your stylesheet needs a certain piece of data, the query must be adjusted to fetch it. By setting he class variable xslt_template in the same way as you did for the SPARQL query, you can override the default.

    xslt_template = "rfc.xsl"

The contents of the resource rfc.xsl, which should be placed in a subdirectory named res in the current directory, should be:

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
		xmlns:xhtml="http://www.w3.org/1999/xhtml"
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
		xmlns:dcterms="http://purl.org/dc/terms/"
		xmlns:rfc="http://example.org/ontology/rfc/"
		xml:space="preserve"
		exclude-result-prefixes="xhtml rdf">

  <xsl:include href="base.xsl"/>

  <!-- Implementations of templates called by base.xsl -->
  <xsl:template name="headtitle"><xsl:value-of select="//xhtml:title"/> | <xsl:value-of select="$configuration/sitename"/></xsl:template>
  <xsl:template name="metarobots"/>
  <xsl:template name="linkalternate"/>
  <xsl:template name="headmetadata"/>
  <xsl:template name="bodyclass">rfc</xsl:template>
  <xsl:template name="pagetitle">
    <h1><xsl:value-of select="../xhtml:head/xhtml:title"/></h1>
  </xsl:template>

  <xsl:template match="xhtml:a"><a href="{@href}"><xsl:value-of select="."/></a></xsl:template>

  <xsl:template match="xhtml:pre[1]">
    <pre><xsl:apply-templates/>
    </pre>
    <xsl:if test="count(ancestor::*) = 2">
      <xsl:call-template name="aside-annotations">
	<xsl:with-param name="uri" select="../@about"/>
      </xsl:call-template>
    </xsl:if>
  </xsl:template>

  <!-- everything that has an @about attribute, i.e. _is_ something
       (with a URI) gets a <section> with an <aside> for inbound links etc -->
  <xsl:template match="xhtml:div[@about]">
    
    <div class="section-wrapper" about="{@about}"><!-- needed? -->
      <section id="{substring-after(@about,'#')}">
	<xsl:variable name="sectionheading"><xsl:if test="xhtml:span[@property='bibo:chapter']/@content"><xsl:value-of select="xhtml:span[@property='bibo:chapter']/@content"/>. </xsl:if><xsl:value-of select="@content"/></xsl:variable>
	<xsl:if test="count(ancestor::*) = 2">
	    <h2><xsl:value-of select="$sectionheading"/></h2>
	</xsl:if>
	<xsl:if test="count(ancestor::*) = 3">
	  <h3><xsl:value-of select="$sectionheading"/></h3>
	</xsl:if>
	<xsl:if test="count(ancestor::*) = 4">
	  <h4><xsl:value-of select="$sectionheading"/></h4>
	</xsl:if>
	<xsl:apply-templates select="*[not(@about)]"/>
      </section>
      <xsl:call-template name="aside-annotations">
	<xsl:with-param name="uri" select="@about"/>
      </xsl:call-template>
    </div>
    <xsl:apply-templates select="xhtml:div[@about]"/>
  </xsl:template>

  <!-- remove spans which only purpose is to contain RDFa data -->
  <xsl:template match="xhtml:span[@property and @content and not(text())]"/>
  
  <!-- construct the side navigation -->
  <xsl:template match="xhtml:div[@about]" mode="toc">
    <li><a href="#{substring-after(@about,'#')}"><xsl:if test="xhtml:span/@content"><xsl:value-of select="xhtml:span[@property='bibo:chapter']/@content"/>. </xsl:if><xsl:value-of select="@content"/></a><xsl:if test="xhtml:div[@about]">
    <ul><xsl:apply-templates mode="toc"/></ul>
    </xsl:if></li>
  </xsl:template>

  <!-- named template called from other templates which match
       xhtml:div[@about] and pre[1] above, and which creates -->
  <xsl:template name="aside-annotations">
    <xsl:param name="uri"/>
    <xsl:if test="$annotations/resource[@uri=$uri]/dcterms:isReferencedBy">
      <aside class="annotations">
	<h2>References to <xsl:value-of select="$annotations/resource[@uri=$uri]/dcterms:identifier"/></h2>
	<xsl:for-each select="$annotations/resource[@uri=$uri]/rfc:isObsoletedBy">
	  <xsl:variable name="referencing" select="@ref"/>
	  Obsoleted by
	  <a href="{@ref}">
	    <xsl:value-of select="$annotations/resource[@uri=$referencing]/dcterms:identifier"/>
	  </a><br/>
	</xsl:for-each>
	<xsl:for-each select="$annotations/resource[@uri=$uri]/rfc:isUpdatedBy">
	  <xsl:variable name="referencing" select="@ref"/>
	  Updated by
	  <a href="{@ref}">
	    <xsl:value-of select="$annotations/resource[@uri=$referencing]/dcterms:identifier"/>
	  </a><br/>
	</xsl:for-each>
	<xsl:for-each select="$annotations/resource[@uri=$uri]/dcterms:isReferencedBy">
	  <xsl:variable name="referencing" select="@ref"/>
	  Referenced by
	  <a href="{@ref}">
	    <xsl:value-of select="$annotations/resource[@uri=$referencing]/dcterms:identifier"/>
	  </a><br/>
	</xsl:for-each>
      </aside>
    </xsl:if>
  </xsl:template>

  <!-- default template: translate everything from whatever namespace
       it's in (usually the XHTML1.1 NS) into the default namespace
       -->
  <xsl:template match="*"><xsl:element name="{local-name(.)}"><xsl:apply-templates select="node()"/></xsl:element></xsl:template>

  <!-- default template for toc handling: do nothing -->
  <xsl:template match="@*|node()" mode="toc"/>
  
</xsl:stylesheet>

This XSLT stylesheet depends on base.xsl (which resides in ferenda/res/xsl in the source distribution of ferenda – take a look if you want to know how everything fits together). The main responsibility of this stylesheet is to format individual elements of the document body.

base.xsl takes care of the main chrome of the page, and it has a default implementation (that basically transforms everything from XHTML1.1 to HTML5, and removes some RDFa-only elements). It also loads and provides the annotation file in the global variable $annotations. The above XSLT stylesheet uses this to fetch information about referencing documents. In particular, when processing an older document, it lists if later documents have updated or obsoleted it (see the named template aside-annotations).

You might notice that this XSLT template flattens the nested structure of sections which we spent so much effort to create in the parse step. This is to make it easier to put up the aside boxes next to each part of the document, independent of the nesting level.

Note

While both the SPARQL query and the XSLT stylesheet might look complicated (and unless you’re a RDF/XSL expert, they are…), most of the time you can get a good result using the default generic query and stylesheet.

Customizing `toc()`¶

The purpose of the toc() method is to create a set of pages that acts as tables of contents for all documents in your docrepo. For large document collections there are often several different ways of creating such tables, eg. sorted by title, publication date, document status, author and similar. The pages uses the same site-branding,headers, footers, navigation menus etc used by generate().

The default implementation is generic enough to handle most cases, but you’ll have to override other methods which it calls, primarily facets() and toc_item(). These methods depend on the metadata you’ve created by your parse implementation, but in the simplest cases it’s enough to specify that you want one set of pages organized by the dcterms:title of each document (alphabetically sorted) and another by dcterms:issued (numerically/calendarically sorted). The default implementation does exactly this.

In our case, we wish to create four kinds of sorting: By identifier (RFC number), by date of issue, by title and by category. These map directly to four kinds of metadata that we’ve stored about each and every document. By overriding facets() we can specify these four facets, aspects of documents used for grouping and sorting.

    def facets(self):
        from ferenda import Facet
        return [Facet(self.ns['dcterms'].title),
                Facet(self.ns['dcterms'].issued),
                Facet(self.ns['dcterms'].subject),
                Facet(self.ns['dcterms'].identifier)]

After running toc with this change, you can see that three sets of index pages are created. By default, the dcterms:identifier predicate isn’t used for the TOC pages, as it’s often derived from the document title. Furthermore, you’ll get some error messages along the lines of “Best Current Practice does not look like a valid URI”, which is because the dcterms:subject predicate normally should have URIs as values, and we are using plain string literals.

We can fix both these problems by customizing our facet objects a little. We specify that we wish to use dcterms:identifier as a TOC facet, and provide a simple method to group RFCs by their identifier in groups of 100, ie one page for RFC 1-99, another for RFC 100-199, and so on. We also specify that we expect our dcterms:subject values to be plain strings.

    def facets(self):
        def select_rfcnum(row, binding, resource_graph):
            # "RFC 6998" -> "6900"
            return row[binding][4:-2] + "00"
        from ferenda import Facet
        return [Facet(self.ns['dcterms'].title),
                Facet(self.ns['dcterms'].issued),
                Facet(self.ns['dcterms'].subject,
                      selector=Facet.defaultselector,
                      identificator=Facet.defaultselector,
                      key=Facet.defaultselector),
                Facet(self.ns['dcterms'].identifier,
                      use_for_toc=True,
                      selector=select_rfcnum,
                      pagetitle="RFC %(selected)s00-%(selected)s99")]

The above code gives some example of how Facet objects can be configured. However, a Facet object does not control how each individual document is listed on a toc page. The default formatting just lists the title of the document, linked to the document in question. For RFCs, who mainly is referenced using their RFC number rather than their title, we’d like to add the RFC number in this display. This is done by overriding toc_item().

    def toc_item(self, binding, row):
        from ferenda.elements import Link
        return [row['dcterms_identifier'] + ": ",
                Link(row['dcterms_title'], 
                     uri=row['uri'])]

Se also Customizing the table(s) of content and Grouping documents with facets.

Customizing `news()`¶

The purpose of news(), the next to final step, is to provide a set of news feeds for your document repository.

The default implementation gives you one single news feed for all documents in your docrepo, and creates both browser-ready HTML (using the same headers, footers, navigation menus etc used by generate()) and Atom syndication format files.

The facets you’ve defined for your docrepo are re-used to create news feeds for eg. all documents published by a particular entity, or all documents of a certain type. Only facet objects which has the use_for_feed property set to a truthy value are used to construct newsfeeds.

In this example, we adjust the facet based on dcterms:subject so that it can be used for newsfeed generation.

    def facets(self):
        def select_rfcnum(row, binding, resource_graph):
            # "RFC 6998" -> "6900"
            return row[binding][4:-2] + "00"
        from ferenda import Facet
        return [Facet(self.ns['dcterms'].title),
                Facet(self.ns['dcterms'].issued),
                Facet(self.ns['dcterms'].subject,
                      selector=Facet.defaultselector,
                      identificator=Facet.defaultidentificator,
                      key=Facet.defaultselector,
                      use_for_feed=True),
                Facet(self.ns['dcterms'].identifier,
                      use_for_toc=True,
                      selector=select_rfcnum,
                      pagetitle="RFC %(selected)s00-%(selected)s99")]

When running news, this will create five different atom feeds (which are mirrored as HTML pages) under data/rfc/news: One containing all documents, and four others that contain documents in a particular category (eg having a particular dcterms:subject value.

Note

As you can see, the resulting HTML pages are a little rough around the edges. Also, there isn’t currently any way of discovering the Atom feeds or HTML pages from the main site – you need to know the URLs. This will all be fixed in due time.

Se also Customizing the news feeds.

Customizing `frontpage()`¶

Finally, frontpage() creates a front page for your entire site with content from the different docrepos. Each docrepos frontpage_content() method will be called, and should return a XHTML fragment with information about the repository and it’s content. Below is a simple example that uses functionality we’ve used in other contexts to create a list of the five latest documents, as well as a total count of documents.

    def frontpage_content(self, primary=False):
        from rdflib import URIRef, Graph
        from itertools import islice
        items = ""
        for entry in islice(self.news_entries(),5):
            graph = Graph()
            with self.store.open_distilled(entry.basefile) as fp:
                graph.parse(data=fp.read())
            data = {'identifier': graph.value(URIRef(entry.id), self.ns['dcterms'].identifier).toPython(),
                    'uri': entry.id,
                    'title': entry.title}
            items += '<li>%(identifier)s <a href="%(uri)s">%(title)s</a></li>' % data
        return ("""<h2><a href="%(uri)s">Request for comments</a></h2>
                   <p>A complete archive of RFCs in Linked Data form. Contains %(doccount)s documents.</p>
                   <p>Latest 5 documents:</p>
                   <ul>
                      %(items)s
                   </ul>""" % {'uri':self.dataset_uri(),
                               'items': items,
                               'doccount': len(list(self.store.list_basefiles_for("_postgenerate")))})

Next steps¶

When you have written code and customized downloading, parsing and all the other steps, you’ll want to run all these steps for all your docrepos in a single command by using the special value all for docrepo, and again all for action:

./ferenda-build.py all all

By now, you should have a basic idea about the key concepts of ferenda. In the next section, Key concepts, we’ll explore them further.

Creating your own document repositories¶

Writing your own download implementation¶

A simple example¶

The main flow of the download process¶

The optional basefile argument¶

The downloadmax() decorator¶

Writing your own parse implementation¶

Handling document structure¶

Handling citations in text¶

Calling relate()¶

Calling makeresources()¶

Customizing generate()¶

Getting annotations¶

Transforming to HTML¶

Customizing toc()¶

Customizing news()¶

Customizing frontpage()¶

Next steps¶

Writing your own `download` implementation¶

The `downloadmax()` decorator¶

Writing your own `parse` implementation¶

Calling `relate()`¶

Calling `makeresources()`¶

Customizing `generate()`¶

Customizing `toc()`¶

Customizing `news()`¶

Customizing `frontpage()`¶