First steps

Ferenda can be used in a project-like manner with a command-line tool (similar to how projects based on Django, Sphinx and Scrapy are used), or it can be used programatically through a simple API. In this guide, we’ll primarily be using the command-line tool, and then show how to achieve the same thing using the API.

The first step is to create a project. Lets make a simple website that contains published standards from W3C and IETF, called “netstandards”. Ferenda installs a system-wide command-line tool called ferenda-setup whose sole purpose is to create projects:

$ ferenda-setup netstandards
Prerequisites ok
Selected SQLITE as triplestore
Selected WHOOSH as search engine
Project created in netstandards
$ cd netstandards
$ ls
ferenda-build.py
ferenda.ini
wsgi.py

The three files created by ferenda-setup is another command line tool (ferenda-build.py) used for management of the newly created project, a WSGI application (wsgi.py, see The WSGI app) and a configuration file (ferenda.ini). The default configuration file specifies most, but not all, of the available configuration parameters. See Configuration for a full list of the standard configuration parameters.

Note

When using the API, you don’t create a project or deal with configuration files in the same way. Instead, your client code is responsible for keeping track of which docrepos to use, and providing configuration when calling their methods.

Creating a Document repository class

Any document collection is handled by a DocumentRepository class (or docrepo for short), so our first task is to create a docrepo for W3C standards.

A docrepo class is responsible for downloading documents in a specific document collection. These classes can inherit from DocumentRepository, which amongst others provides the method download() for this. Since the details of how documents are made available on the web differ greatly from collection to collection, you’ll often have to override the default implementation, but in this particular case, it suffices. The default implementation assumes that all documents are available from a single index page, and that the URLs of the documents follow a set pattern.

The W3C standards are set up just like that: All standards are available at http://www.w3.org/TR/tr-status-all. There are a lot of links to documents on that page, and not all of them are links to recommended standards. A simple way to find only the recommended standards is to see if the link follows the pattern http://www.w3.org/TR/<year>/REC-<standardid>-<date>.

Creating a docrepo that is able to download all web standards is then as simple as creating a subclass and setting three class properties. Create this class in the current directory (or anywhere else on your python path) and save it as w3cstandards.py

from ferenda import DocumentRepository

class W3CStandards(DocumentRepository):
    alias = "w3c"
    start_url = "http://www.w3.org/TR/tr-status-all"
    document_url_regex = "http://www.w3.org/TR/(?P<year>\d{4})/REC-(?P<basefile>.*)-(?P<date>\d+)"

The first property, alias, is required for all docrepos and controls the alias used by the command line tool for that docrepo, as well as the path where files are stored, amongst other things. If your project has a large collection of docrepos, it’s important that they all have unique aliases.

The other two properties are parameters which the default implementation of download() uses in order to find out which documents to download. start_url is just a simple regular URL, while document_url_regex is a standard re regex with named groups. The group named basefile has special meaning, and will be used as a base for stored files and elsewhere as a short identifier for the document. For example, the web standard found at URL http://www.w3.org/TR/2012/REC-rdf-plain-literal-20121211/ will have the basefile rdf-plain-literal.

Using ferenda-build.py and registering docrepo classes

Next step is to enable our class. Like most tasks, this is done using the command line tool present in your project directory. To register the class (together with a short alias) in your ferenda.ini configuration file, run the following:

$ ./ferenda-build.py w3cstandards.W3CStandards enable
22:16:26 root INFO Enabled class w3cstandards.W3CStandards (alias 'w3c')

This creates a new section in ferenda.ini that just looks like the following:

[w3c]
class = w3cstandards.W3CStandards

From this point on, you can use the class name or the alias “w3c” interchangably:

$ ./ferenda-build.py w3cstandards.W3CStandards status # verbose
22:16:27 root INFO w3cstandards.W3CStandards status finished in 0.010 sec
Status for document repository 'w3c' (w3cstandards.W3CStandards)
 download: None.
 parse: None.
 generated: None.

$ ./ferenda-build.py w3c status # terse, exactly the same result

Note

When using the API, there is no need (nor possibility) to register docrepo classes. Your client code directly instantiates the class(es) it uses and calls methods on them.

Downloading

To test the downloading capabilities of our class, you can run the download method directly from the command line using the command line tool:

$ ./ferenda-build.py w3c download 
22:16:31 w3c INFO Downloading max 3 documents
22:16:32 w3c INFO emotionml: downloaded from http://www.w3.org/TR/2014/REC-emotionml-20140522/
22:16:33 w3c INFO MathML3: downloaded from http://www.w3.org/TR/2014/REC-MathML3-20140410/
22:16:33 w3c INFO xml-entity-names: downloaded from http://www.w3.org/TR/2014/REC-xml-entity-names-20140410/
# and so on...

After a few minutes of downloading, the result is a bunch of files in data/w3c/downloaded:

$ ls -1 data/w3c/downloaded
MathML3.html
MathML3.html.etag
emotionml.html
emotionml.html.etag
xml-entity-names.html
xml-entity-names.html.etag

Note

The .etag files are created in order to support Conditional GET, so that we don’t waste our time or remote server bandwith by re-downloading documents that hasn’t changed. They can be ignored and might go away in future versions of Ferenda.

We can get a overview of the status of our docrepo using the status command:

$ ./ferenda-build.py w3cstandards.W3CStandards status # verbose
22:16:27 root INFO w3cstandards.W3CStandards status finished in 0.010 sec
Status for document repository 'w3c' (w3cstandards.W3CStandards)
 download: None.
 parse: None.
 generated: None.

$ ./ferenda-build.py w3c status # terse, exactly the same result

Note

To do the same using the API:

from w3cstandards import W3CStandards
repo = W3CStandards()
repo.download()  
repo.status()
# or use repo.get_status() to get all status information in a nested dict

Finally, if the logging information scrolls by too quickly and you want to read it again, take a look in the data/logs directory. Each invocation of ferenda-build.py creates a new log file containing the same information that is written to stdout.

Parsing

Let’s try the next step in the workflow, to parse one of the documents we’ve downloaded.

$ ./ferenda-build.py w3c parse rdfa-core
22:16:45 w3c INFO rdfa-core: parse OK (4.863 sec)
22:16:45 root INFO w3c parse finished in 4.935 sec

By now, you might have realized that our command line tool generally is called in the following manner:

$ ./ferenda-build.py <docrepo> <command> [argument(s)]

The parse command resulted in one new file being created in data/w3c/parsed.

$ ls -1 data/w3c/parsed
rdfa-core.xhtml

And we can again use the status command to get a comprehensive overview of our document repository.

$ ./ferenda-build.py w3c status
22:16:47 root INFO w3c status finished in 0.032 sec
Status for document repository 'w3c' (w3cstandards.W3CStandards)
 download: xml-entity-names, rdfa-core, emotionml... (1 more)
 parse: rdfa-core. Todo: xml-entity-names, emotionml, MathML3.
 generated: None. Todo: rdfa-core.

Note that by default, subsequent invocations of parse won’t actually parse documents that don’t need parsing.

$ ./ferenda-build.py w3c parse rdfa-core
22:16:50 root INFO w3c parse finished in 0.019 sec

But during development, when you change the parsing code frequently, you’ll need to override this through the --force flag (or set the force parameter in ferenda.ini).

$ ./ferenda-build.py w3c parse rdfa-core --force
22:16:56 w3c INFO rdfa-core: parse OK (5.123 sec)
22:16:56 root INFO w3c parse finished in 5.166 sec

Note

To do the same using the API:

from w3cstandards import W3CStandards
repo = W3CStandards(force=True)
repo.parse("rdfa-core")

Note also that you can parse all downloaded documents through the --all flag, and control logging verbosity by the --loglevel flag.

$ ./ferenda-build.py w3c parse --all --loglevel=DEBUG
22:16:59 w3c DEBUG xml-entity-names: Starting
22:16:59 w3c DEBUG xml-entity-names: Created data/w3c/parsed/xml-entity-names.xhtml
22:17:00 w3c DEBUG xml-entity-names: 6 triples extracted to data/w3c/distilled/xml-entity-names.rdf
22:17:00 w3c INFO xml-entity-names: parse OK (0.717 sec)
22:17:00 w3c DEBUG emotionml: Starting
22:17:00 w3c DEBUG emotionml: Created data/w3c/parsed/emotionml.xhtml
22:17:01 w3c DEBUG emotionml: 11 triples extracted to data/w3c/distilled/emotionml.rdf
22:17:01 w3c INFO emotionml: parse OK (1.174 sec)
22:17:01 w3c DEBUG MathML3: Starting
22:17:01 w3c DEBUG MathML3: Created data/w3c/parsed/MathML3.xhtml
22:17:01 w3c DEBUG MathML3: 8 triples extracted to data/w3c/distilled/MathML3.rdf
22:17:01 w3c INFO MathML3: parse OK (0.332 sec)
22:17:01 root INFO w3c parse finished in 2.247 sec

Note

To do the same using the API:

import logging
from w3cstandards import W3CStandards
# client code is responsible for setting the effective log level -- ferenda 
# just emits log messages, and depends on the caller to setup the logging 
# subsystem in an appropriate way
logging.getLogger().setLevel(logging.INFO)
repo = W3CStandards()
for basefile in repo.store.list_basefiles_for("parse"):
    # You you might want to try/catch the exception
    # ferenda.errors.ParseError or any of it's children here
    repo.parse(basefile)

Note that the API makes you explicitly list and iterate over any available files. This is so that client code has the opportunity to parallelize this work in an appropriate way.

If we take a look at the files created in data/w3c/distilled, we see some metadata for each document. This metadata has been automatically extracted from RDFa statements in the XHTML documents, but is so far very spartan.

Now take a look at the files created in data/w3c/parsed. The default implementation of parse() processes the DOM of the main body of the document, but some tags and attribute that are used only for formatting are stripped, such as <style> and <script>.

These documents have quite a lot of “boilerplate” text such as table of contents and links to latest and previous versions which we’d like to remove so that just the actual text is left (problem 1). And we’d like to explicitly extract some parts of the document and represent these as metadata for the document – for example the title, the publication date, the authors/editors of the document and it’s abstract, if available (problem 2).

Just like the default implementation of download() allowed for some customization using class variables, we can solve problem 1 by setting two additional class variables:

    parse_content_selector="body"
    parse_filter_selectors=["div.toc", "div.head"]

The parse_content_selector member specifies, using CSS selector syntax, the part of the document which contains our main text. It defaults to "body", and can often be set to ".content" (the first element that has a class=”content” attribute”), "#main-text" (any element with the id "main-text"), "article" (the first <article> element) or similar. The parse_filter_selectors is a list of similar selectors, with the difference that all matching elements are removed from the tree. In this case, we use it to remove some boilerplate sections that often within the content specified by parse_content_selector, but which we don’t want to appear in the final result.

In order to solve problem 2, we can override one of the methods that the default implementation of parse() calls:

    def parse_metadata_from_soup(self, soup, doc):
        from rdflib import Namespace
        from ferenda import Describer
        from ferenda import util
        import re
        DCTERMS = Namespace("http://purl.org/dc/terms/")
        FOAF = Namespace("http://xmlns.com/foaf/0.1/")
        d = Describer(doc.meta, doc.uri)
        d.rdftype(FOAF.Document)
        d.value(DCTERMS.title, soup.find("title").text, lang=doc.lang)
        d.value(DCTERMS.abstract, soup.find(True, "abstract"), lang=doc.lang)
        # find the issued date -- assume it's the first thing that looks
        # like a date on the form "22 August 2013"
        re_date = re.compile(r'(\d+ \w+ \d{4})')
        datenode = soup.find(text=re_date)
        datestr = re_date.search(datenode).group(1)
        d.value(DCTERMS.issued, util.strptime(datestr, "%d %B %Y"))
        editors = soup.find("dt", text=re.compile("Editors?:"))
        for editor in editors.find_next_siblings("dd"):
            editor_name = editor.text.strip().split(", ")[0]
            d.value(DCTERMS.editor, editor_name)

parse_metadata_from_soup() is called with a document object and the parsed HTML document in the form of a BeautifulSoup object. It is the responsibility of parse_metadata_from_soup() to add document-level metadata for this document, such as it’s title, publication date, and similar. Note that parse_metadata_from_soup() is run before the parse_content_selector and parse_filter_selectors are applied, so the BeautifulSoup object passed into it contains the entire document.

Note

The selectors are passed to BeautifulSoup.select(), which supports a subset of the CSS selector syntax. If you stick with simple tag, id and class-based selectors you should be fine.

Now, if you run parse --force again, both documents and metadata are in better shape. Further down the line the value of properly extracted metadata will become more obvious.

Republishing the parsed content

The XHTML contains metadata in RDFa format. As such, you can extract all that metadata and put it into a triple store. The relate command does this, as well as creating a full text index of all textual content:

$ ./ferenda-build.py w3c relate --all
22:17:03 w3c INFO xml-entity-names: relate OK (0.618 sec)
22:17:04 w3c INFO rdfa-core: relate OK (1.542 sec)
22:17:06 w3c INFO emotionml: relate OK (1.647 sec)
22:17:08 w3c INFO MathML3: relate OK (1.604 sec)
22:17:08 w3c INFO Dumped 34 triples from context http://localhost:8000/dataset/w3c to data/w3c/distilled/dump.nt (0.007 sec)
22:17:08 root INFO w3c relate finished in 5.555 sec

The next step is to create a number of resource files (placed under data/rsrc). These resource files include css and javascript files for the new website we’re creating, as well as a xml configuration file used by the XSLT transformation done by generate below:

$ ./ferenda-build.py w3c makeresources
22:17:08 ferenda.resources INFO Wrote data/rsrc/resources.xml
$ find data/rsrc -print
data/rsrc
data/rsrc/api
data/rsrc/api/common.json
data/rsrc/api/context.json
data/rsrc/api/terms.json
data/rsrc/css
data/rsrc/css/ferenda.css
data/rsrc/css/main.css
data/rsrc/css/normalize-1.1.3.css
data/rsrc/img
data/rsrc/img/navmenu-small-black.png
data/rsrc/img/navmenu.png
data/rsrc/img/search.png
data/rsrc/js
data/rsrc/js/ferenda.js
data/rsrc/js/jquery-1.10.2.js
data/rsrc/js/modernizr-2.6.3.js
data/rsrc/js/respond-1.3.0.js
data/rsrc/resources.xml

Note

It is possible to combine and minify both javascript and css files using the combineresources option in the configuration file.

Running makeresources is needed for the final few steps.

$ ./ferenda-build.py w3c generate --all
22:17:14 w3c INFO xml-entity-names: generate OK (1.728 sec)
22:17:14 w3c INFO rdfa-core: generate OK (0.242 sec)
22:17:14 w3c INFO emotionml: generate OK (0.336 sec)
22:17:14 w3c INFO MathML3: generate OK (0.216 sec)
22:17:14 root INFO w3c generate finished in 2.535 sec

The generate command creates browser-ready HTML5 documents from our structured XHTML documents, using our site’s navigation.

$ ./ferenda-build.py w3c toc
22:17:17 w3c INFO Created data/w3c/toc/dcterms_issued/2014.html
22:17:17 w3c INFO Created data/w3c/toc/dcterms_title/m.html
22:17:17 w3c INFO Created data/w3c/toc/dcterms_title/r.html
22:17:17 w3c INFO Created data/w3c/toc/dcterms_title/x.html
22:17:18 w3c INFO Created data/w3c/toc/index.html
22:17:18 root INFO w3c toc finished in 2.059 sec
$ ./ferenda-build.py w3c news
21:43:55 w3c INFO feed type/document: 4 entries
22:17:19 w3c INFO feed main: 4 entries
22:17:19 root INFO w3c news finished in 0.115 sec
$ ./ferenda-build.py w3c frontpage
22:17:21 root INFO frontpage: wrote data/index.html (0.112 sec)

The toc and feeds commands creates static files for general indexes/tables of contents of all documents in our docrepo as well as Atom feeds, and the frontpage command creates a suitable frontpage for the site as a whole.

Note

To do all of the above using the API:

from ferenda import manager
from w3cstandards import W3CStandards
repo = W3CStandards()
for basefile in repo.store.list_basefiles_for("relate"):
    repo.relate(basefile)
manager.makeresources([repo], sitename="Standards", sitedescription="W3C standards, in a new form")
for basefile in repo.store.list_basefiles_for("generate"):
    repo.generate(basefile)
repo.toc()
repo.news()
manager.frontpage([repo])

Finally, to start a development web server and check out the finished result:

$ ./ferenda-build.py w3c runserver
$ open http://localhost:8080/

Now you’ve created your own web site with structured documents. It contains listings of all documents, feeds with updated documents (in both HTML and Atom flavors), full text search, and an API. In order to deploy your site, you can run it under Apache+mod_wsgi, ngnix+uWSGI, Gunicorn or just about any WSGI capable web server, see The WSGI app.

Note

Using runserver() from the API does not really make any sense. If your environment supports running WSGI applications, see the above link for information about how to get the ferenda WSGI application. Otherwise, the app can be run by any standard WSGI host.

To keep it up-to-date whenever the W3C issues new standards, use the following command:

$ ./ferenda-build.py w3c all
22:17:25 w3c INFO Downloading max 3 documents
22:17:25 root INFO w3cstandards.W3CStandards download finished in 2.648 sec
22:17:25 root INFO w3cstandards.W3CStandards parse finished in 0.019 sec
22:17:25 root INFO w3cstandards.W3CStandards relate: Nothing to do!
22:17:25 root INFO w3cstandards.W3CStandards relate finished in 0.025 sec
22:17:25 ferenda.resources INFO Wrote data/rsrc/resources.xml
22:17:29 root INFO w3cstandards.W3CStandards generate finished in 0.006 sec
22:17:32 root INFO w3cstandards.W3CStandards toc finished in 3.376 sec
22:17:34 w3c INFO feed type/document: 4 entries
22:17:32 w3c INFO feed main: 4 entries
22:17:32 root INFO w3cstandards.W3CStandards news finished in 0.063 sec
22:17:32 root INFO frontpage: wrote data/index.html (0.017 sec)

The “all” command is an alias that runs download, parse --all, relate --all, generate --all, toc and feeds in sequence.

Note

The API doesn’t have any corresponding method. Just run all of the above code again. As long as you don’t pass the force=True parameter when creating the docrepo instance, ferendas dependency management should make sure that documents aren’t needlessly re-parsed etc.

This 20-line example of a docrepo took a lot of shortcuts by depending on the default implementation of the download() and parse() methods. Ferenda tries to make it really to get something up and running quickly, and then improving each step incrementally.

In the next section Creating your own document repositories we will take a closer look at each of the six main steps (download, parse, relate, generate, toc and news), including how to completely replace the built-in methods. You can also take a look at the source code for ferenda.sources.tech.W3Standards, which contains a more complete (and substantially longer) implementation of download(), parse() and the others.