First steps¶
Ferenda can be used in a project-like manner with a command-line tool (similar to how projects based on Django, Sphinx and Scrapy are used), or it can be used programatically through a simple API. In this guide, we’ll primarily be using the command-line tool, and then show how to achieve the same thing using the API.
The first step is to create a project. Lets make a simple website that
contains published standards from W3C and IETF, called
“netstandards”. Ferenda installs a system-wide command-line tool
called ferenda-setup
whose sole purpose is to create projects:
$ ferenda-setup netstandards
Prerequisites ok
Selected SQLITE as triplestore
Selected WHOOSH as search engine
Project created in netstandards
$ cd netstandards
$ ls
ferenda-build.py
ferenda.ini
wsgi.py
The three files created by ferenda-setup
is another command line
tool (ferenda-build.py
) used for management of the newly created
project, a WSGI application (wsgi.py
, see The WSGI app) and a
configuration file (ferenda.ini
). The default configuration file
specifies most, but not all, of the available configuration
parameters. See Configuration for a full list of the standard
configuration parameters.
Note
When using the API, you don’t create a project or deal with configuration files in the same way. Instead, your client code is responsible for keeping track of which docrepos to use, and providing configuration when calling their methods.
Creating a Document repository class¶
Any document collection is handled by a DocumentRepository class (or docrepo for short), so our first task is to create a docrepo for W3C standards.
A docrepo class is responsible for downloading documents in a specific
document collection. These classes can inherit from
DocumentRepository
, which amongst others provides
the method download()
for
this. Since the details of how documents are made available on the web
differ greatly from collection to collection, you’ll often have to
override the default implementation, but in this particular case, it
suffices. The default implementation assumes that all documents are
available from a single index page, and that the URLs of the documents
follow a set pattern.
The W3C standards are set up just like that: All standards are
available at http://www.w3.org/TR/tr-status-all
. There are a lot
of links to documents on that page, and not all of them are links to
recommended standards. A simple way to find only the recommended
standards is to see if the link follows the pattern
http://www.w3.org/TR/<year>/REC-<standardid>-<date>
.
Creating a docrepo that is able to download all web
standards is then as simple as creating a subclass and setting three
class properties. Create this class in the current directory (or
anywhere else on your python path) and save it as w3cstandards.py
from ferenda import DocumentRepository
class W3CStandards(DocumentRepository):
alias = "w3c"
start_url = "http://www.w3.org/TR/tr-status-all"
document_url_regex = "http://www.w3.org/TR/(?P<year>\d{4})/REC-(?P<basefile>.*)-(?P<date>\d+)"
The first property, alias
, is
required for all docrepos and controls the alias used by the command
line tool for that docrepo, as well as the path where files are
stored, amongst other things. If your project has a large collection
of docrepos, it’s important that they all have unique aliases.
The other two properties are parameters which the default
implementation of download()
uses in
order to find out which documents to
download. start_url
is just a
simple regular URL, while
document_url_regex
is a standard
re
regex with named groups. The group named basefile
has
special meaning, and will be used as a base for stored files and
elsewhere as a short identifier for the document. For example, the web
standard found at URL
http://www.w3.org/TR/2012/REC-rdf-plain-literal-20121211/ will have
the basefile rdf-plain-literal
.
Using ferenda-build.py and registering docrepo classes¶
Next step is to enable our class. Like most tasks, this is done using
the command line tool present in your project directory. To register
the class (together with a short alias) in your ferenda.ini
configuration file, run the following:
$ ./ferenda-build.py w3cstandards.W3CStandards enable
22:16:26 root INFO Enabled class w3cstandards.W3CStandards (alias 'w3c')
This creates a new section in ferenda.ini that just looks like the following:
[w3c]
class = w3cstandards.W3CStandards
From this point on, you can use the class name or the alias “w3c” interchangably:
$ ./ferenda-build.py w3cstandards.W3CStandards status # verbose
22:16:27 root INFO w3cstandards.W3CStandards status finished in 0.010 sec
Status for document repository 'w3c' (w3cstandards.W3CStandards)
download: None.
parse: None.
generated: None.
$ ./ferenda-build.py w3c status # terse, exactly the same result
Note
When using the API, there is no need (nor possibility) to register docrepo classes. Your client code directly instantiates the class(es) it uses and calls methods on them.
Downloading¶
To test the downloading capabilities of our class, you can run the download method directly from the command line using the command line tool:
$ ./ferenda-build.py w3c download
22:16:31 w3c INFO Downloading max 3 documents
22:16:32 w3c INFO emotionml: downloaded from http://www.w3.org/TR/2014/REC-emotionml-20140522/
22:16:33 w3c INFO MathML3: downloaded from http://www.w3.org/TR/2014/REC-MathML3-20140410/
22:16:33 w3c INFO xml-entity-names: downloaded from http://www.w3.org/TR/2014/REC-xml-entity-names-20140410/
# and so on...
After a few minutes of downloading, the result is a bunch of files in
data/w3c/downloaded
:
$ ls -1 data/w3c/downloaded
MathML3.html
MathML3.html.etag
emotionml.html
emotionml.html.etag
xml-entity-names.html
xml-entity-names.html.etag
Note
The .etag
files are created in order to support Conditional
GET, so that we don’t
waste our time or remote server bandwith by re-downloading
documents that hasn’t changed. They can be ignored and might go
away in future versions of Ferenda.
We can get a overview of the status of our docrepo using the
status
command:
$ ./ferenda-build.py w3cstandards.W3CStandards status # verbose
22:16:27 root INFO w3cstandards.W3CStandards status finished in 0.010 sec
Status for document repository 'w3c' (w3cstandards.W3CStandards)
download: None.
parse: None.
generated: None.
$ ./ferenda-build.py w3c status # terse, exactly the same result
Note
To do the same using the API:
from w3cstandards import W3CStandards
repo = W3CStandards()
repo.download()
repo.status()
# or use repo.get_status() to get all status information in a nested dict
Finally, if the logging information scrolls by too quickly and you
want to read it again, take a look in the data/logs
directory.
Each invocation of ferenda-build.py
creates a new log file
containing the same information that is written to stdout.
Parsing¶
Let’s try the next step in the workflow, to parse one of the documents we’ve downloaded.
$ ./ferenda-build.py w3c parse rdfa-core
22:16:45 w3c INFO rdfa-core: parse OK (4.863 sec)
22:16:45 root INFO w3c parse finished in 4.935 sec
By now, you might have realized that our command line tool generally is called in the following manner:
$ ./ferenda-build.py <docrepo> <command> [argument(s)]
The parse command resulted in one new file being created in
data/w3c/parsed
.
$ ls -1 data/w3c/parsed
rdfa-core.xhtml
And we can again use the status
command to get a comprehensive
overview of our document repository.
$ ./ferenda-build.py w3c status
22:16:47 root INFO w3c status finished in 0.032 sec
Status for document repository 'w3c' (w3cstandards.W3CStandards)
download: xml-entity-names, rdfa-core, emotionml... (1 more)
parse: rdfa-core. Todo: xml-entity-names, emotionml, MathML3.
generated: None. Todo: rdfa-core.
Note that by default, subsequent invocations of parse won’t actually parse documents that don’t need parsing.
$ ./ferenda-build.py w3c parse rdfa-core
22:16:50 root INFO w3c parse finished in 0.019 sec
But during development, when you change the parsing code frequently,
you’ll need to override this through the --force
flag (or set the
force
parameter in ferenda.ini
).
$ ./ferenda-build.py w3c parse rdfa-core --force
22:16:56 w3c INFO rdfa-core: parse OK (5.123 sec)
22:16:56 root INFO w3c parse finished in 5.166 sec
Note
To do the same using the API:
from w3cstandards import W3CStandards
repo = W3CStandards(force=True)
repo.parse("rdfa-core")
Note also that you can parse all downloaded documents through the
--all
flag, and control logging verbosity by the --loglevel
flag.
$ ./ferenda-build.py w3c parse --all --loglevel=DEBUG
22:16:59 w3c DEBUG xml-entity-names: Starting
22:16:59 w3c DEBUG xml-entity-names: Created data/w3c/parsed/xml-entity-names.xhtml
22:17:00 w3c DEBUG xml-entity-names: 6 triples extracted to data/w3c/distilled/xml-entity-names.rdf
22:17:00 w3c INFO xml-entity-names: parse OK (0.717 sec)
22:17:00 w3c DEBUG emotionml: Starting
22:17:00 w3c DEBUG emotionml: Created data/w3c/parsed/emotionml.xhtml
22:17:01 w3c DEBUG emotionml: 11 triples extracted to data/w3c/distilled/emotionml.rdf
22:17:01 w3c INFO emotionml: parse OK (1.174 sec)
22:17:01 w3c DEBUG MathML3: Starting
22:17:01 w3c DEBUG MathML3: Created data/w3c/parsed/MathML3.xhtml
22:17:01 w3c DEBUG MathML3: 8 triples extracted to data/w3c/distilled/MathML3.rdf
22:17:01 w3c INFO MathML3: parse OK (0.332 sec)
22:17:01 root INFO w3c parse finished in 2.247 sec
Note
To do the same using the API:
import logging
from w3cstandards import W3CStandards
# client code is responsible for setting the effective log level -- ferenda
# just emits log messages, and depends on the caller to setup the logging
# subsystem in an appropriate way
logging.getLogger().setLevel(logging.INFO)
repo = W3CStandards()
for basefile in repo.store.list_basefiles_for("parse"):
# You you might want to try/catch the exception
# ferenda.errors.ParseError or any of it's children here
repo.parse(basefile)
Note that the API makes you explicitly list and iterate over any available files. This is so that client code has the opportunity to parallelize this work in an appropriate way.
If we take a look at the files created in data/w3c/distilled
, we
see some metadata for each document. This metadata has been
automatically extracted from RDFa statements in the XHTML documents,
but is so far very spartan.
Now take a look at the files created in data/w3c/parsed
. The
default implementation of parse()
processes the DOM of the main
body of the document, but some tags and attribute that are used only
for formatting are stripped, such as <style>
and <script>
.
These documents have quite a lot of “boilerplate” text such as table of contents and links to latest and previous versions which we’d like to remove so that just the actual text is left (problem 1). And we’d like to explicitly extract some parts of the document and represent these as metadata for the document – for example the title, the publication date, the authors/editors of the document and it’s abstract, if available (problem 2).
Just like the default implementation of
download()
allowed for some
customization using class variables, we can solve problem 1 by setting
two additional class variables:
parse_content_selector="body"
parse_filter_selectors=["div.toc", "div.head"]
The parse_content_selector
member specifies, using CSS selector syntax, the part of the document
which contains our main text. It defaults to "body"
, and can often
be set to ".content"
(the first element that has a class=”content”
attribute”), "#main-text"
(any element with the id
"main-text"
), "article"
(the first <article>
element) or
similar. The
parse_filter_selectors
is a
list of similar selectors, with the difference that all matching
elements are removed from the tree. In this case, we use it to remove
some boilerplate sections that often within the content specified by
parse_content_selector
, but
which we don’t want to appear in the final result.
In order to solve problem 2, we can override one of the methods that the default implementation of parse() calls:
def parse_metadata_from_soup(self, soup, doc):
from rdflib import Namespace
from ferenda import Describer
from ferenda import util
import re
DCTERMS = Namespace("http://purl.org/dc/terms/")
FOAF = Namespace("http://xmlns.com/foaf/0.1/")
d = Describer(doc.meta, doc.uri)
d.rdftype(FOAF.Document)
d.value(DCTERMS.title, soup.find("title").text, lang=doc.lang)
d.value(DCTERMS.abstract, soup.find(True, "abstract"), lang=doc.lang)
# find the issued date -- assume it's the first thing that looks
# like a date on the form "22 August 2013"
re_date = re.compile(r'(\d+ \w+ \d{4})')
datenode = soup.find(text=re_date)
datestr = re_date.search(datenode).group(1)
d.value(DCTERMS.issued, util.strptime(datestr, "%d %B %Y"))
editors = soup.find("dt", text=re.compile("Editors?:"))
for editor in editors.find_next_siblings("dd"):
editor_name = editor.text.strip().split(", ")[0]
d.value(DCTERMS.editor, editor_name)
parse_metadata_from_soup()
is
called with a document object and the parsed HTML document in the form
of a BeautifulSoup object. It is the responsibility of
parse_metadata_from_soup()
to add
document-level metadata for this document, such as it’s title,
publication date, and similar. Note that
parse_metadata_from_soup()
is run
before the
parse_content_selector
and
parse_filter_selectors
are
applied, so the BeautifulSoup object passed into it contains the
entire document.
Note
The selectors are passed to BeautifulSoup.select(), which supports a subset of the CSS selector syntax. If you stick with simple tag, id and class-based selectors you should be fine.
Now, if you run parse --force
again, both documents and metadata are
in better shape. Further down the line the value of properly extracted
metadata will become more obvious.
Republishing the parsed content¶
The XHTML contains metadata in RDFa format. As such, you can extract all that metadata and put it into a triple store. The relate command does this, as well as creating a full text index of all textual content:
$ ./ferenda-build.py w3c relate --all
22:17:03 w3c INFO xml-entity-names: relate OK (0.618 sec)
22:17:04 w3c INFO rdfa-core: relate OK (1.542 sec)
22:17:06 w3c INFO emotionml: relate OK (1.647 sec)
22:17:08 w3c INFO MathML3: relate OK (1.604 sec)
22:17:08 w3c INFO Dumped 34 triples from context http://localhost:8000/dataset/w3c to data/w3c/distilled/dump.nt (0.007 sec)
22:17:08 root INFO w3c relate finished in 5.555 sec
The next step is to create a number of resource files (placed under
data/rsrc
). These resource files include css and javascript files
for the new website we’re creating, as well as a xml configuration
file used by the XSLT transformation done by generate
below:
$ ./ferenda-build.py w3c makeresources
22:17:08 ferenda.resources INFO Wrote data/rsrc/resources.xml
$ find data/rsrc -print
data/rsrc
data/rsrc/api
data/rsrc/api/common.json
data/rsrc/api/context.json
data/rsrc/api/terms.json
data/rsrc/css
data/rsrc/css/ferenda.css
data/rsrc/css/main.css
data/rsrc/css/normalize-1.1.3.css
data/rsrc/img
data/rsrc/img/navmenu-small-black.png
data/rsrc/img/navmenu.png
data/rsrc/img/search.png
data/rsrc/js
data/rsrc/js/ferenda.js
data/rsrc/js/jquery-1.10.2.js
data/rsrc/js/modernizr-2.6.3.js
data/rsrc/js/respond-1.3.0.js
data/rsrc/resources.xml
Note
It is possible to combine and minify both javascript and css files
using the combineresources
option in the configuration file.
Running makeresources
is needed for the final few steps.
$ ./ferenda-build.py w3c generate --all
22:17:14 w3c INFO xml-entity-names: generate OK (1.728 sec)
22:17:14 w3c INFO rdfa-core: generate OK (0.242 sec)
22:17:14 w3c INFO emotionml: generate OK (0.336 sec)
22:17:14 w3c INFO MathML3: generate OK (0.216 sec)
22:17:14 root INFO w3c generate finished in 2.535 sec
The generate
command creates browser-ready HTML5 documents from
our structured XHTML documents, using our site’s navigation.
$ ./ferenda-build.py w3c toc
22:17:17 w3c INFO Created data/w3c/toc/dcterms_issued/2014.html
22:17:17 w3c INFO Created data/w3c/toc/dcterms_title/m.html
22:17:17 w3c INFO Created data/w3c/toc/dcterms_title/r.html
22:17:17 w3c INFO Created data/w3c/toc/dcterms_title/x.html
22:17:18 w3c INFO Created data/w3c/toc/index.html
22:17:18 root INFO w3c toc finished in 2.059 sec
$ ./ferenda-build.py w3c news
21:43:55 w3c INFO feed type/document: 4 entries
22:17:19 w3c INFO feed main: 4 entries
22:17:19 root INFO w3c news finished in 0.115 sec
$ ./ferenda-build.py w3c frontpage
22:17:21 root INFO frontpage: wrote data/index.html (0.112 sec)
The toc
and feeds
commands creates static files for general
indexes/tables of contents of all documents in our docrepo as well as
Atom feeds, and the frontpage
command creates a suitable frontpage
for the site as a whole.
Note
To do all of the above using the API:
from ferenda import manager
from w3cstandards import W3CStandards
repo = W3CStandards()
for basefile in repo.store.list_basefiles_for("relate"):
repo.relate(basefile)
manager.makeresources([repo], sitename="Standards", sitedescription="W3C standards, in a new form")
for basefile in repo.store.list_basefiles_for("generate"):
repo.generate(basefile)
repo.toc()
repo.news()
manager.frontpage([repo])
Finally, to start a development web server and check out the finished result:
$ ./ferenda-build.py w3c runserver
$ open http://localhost:8080/
Now you’ve created your own web site with structured documents. It contains listings of all documents, feeds with updated documents (in both HTML and Atom flavors), full text search, and an API. In order to deploy your site, you can run it under Apache+mod_wsgi, ngnix+uWSGI, Gunicorn or just about any WSGI capable web server, see The WSGI app.
Note
Using runserver()
from the API does not
really make any sense. If your environment supports running WSGI
applications, see the above link for information about how to get
the ferenda WSGI application. Otherwise, the app can be run by any
standard WSGI host.
To keep it up-to-date whenever the W3C issues new standards, use the following command:
$ ./ferenda-build.py w3c all
22:17:25 w3c INFO Downloading max 3 documents
22:17:25 root INFO w3cstandards.W3CStandards download finished in 2.648 sec
22:17:25 root INFO w3cstandards.W3CStandards parse finished in 0.019 sec
22:17:25 root INFO w3cstandards.W3CStandards relate: Nothing to do!
22:17:25 root INFO w3cstandards.W3CStandards relate finished in 0.025 sec
22:17:25 ferenda.resources INFO Wrote data/rsrc/resources.xml
22:17:29 root INFO w3cstandards.W3CStandards generate finished in 0.006 sec
22:17:32 root INFO w3cstandards.W3CStandards toc finished in 3.376 sec
22:17:34 w3c INFO feed type/document: 4 entries
22:17:32 w3c INFO feed main: 4 entries
22:17:32 root INFO w3cstandards.W3CStandards news finished in 0.063 sec
22:17:32 root INFO frontpage: wrote data/index.html (0.017 sec)
The “all” command is an alias that runs download
, parse --all
,
relate --all
, generate --all
, toc
and feeds
in
sequence.
Note
The API doesn’t have any corresponding method. Just run all of the
above code again. As long as you don’t pass the force=True
parameter when creating the docrepo instance, ferendas dependency
management should make sure that documents aren’t needlessly
re-parsed etc.
This 20-line example of a docrepo took a lot of shortcuts by depending
on the default implementation of the
download()
and
parse()
methods. Ferenda tries to
make it really to get something up and running quickly, and then
improving each step incrementally.
In the next section Creating your own document repositories we will take a closer look
at each of the six main steps (download
, parse
, relate
,
generate
, toc
and news
), including how to completely
replace the built-in methods. You can also take a look at the source
code for ferenda.sources.tech.W3Standards
, which contains a more
complete (and substantially longer) implementation of
download()
,
parse()
and the others.