Introduction to Ferenda

Ferenda is a python library and framework for transforming unstructured document collections into structured Linked Data. It helps with downloading documents, parsing them to add explicit semantic structure and RDF-based metadata, finding relationships between documents, and republishing the results.

It uses the XHTML and RDFa standards for representing semantic structure, and republishes content using Linked Data principles and a REST-based API.

Ferenda works best for large document collections that have some degree of internal standardization, such as the laws of a particular country, technical standards, or reports published in a series. It is particularly useful for collections that contains explicit references between documents, within or across collections.

It is designed to make it easy to get started with basic downloading, parsing and republishing of documents, and then to improve each step incrementally.

Example

Ferenda can be used either as a library or as a command-line tool. This code uses the Ferenda API to create a website containing all(*) RFCs and W3C recommended standards.

from ferenda.sources.tech import RFC, W3Standards
from ferenda.manager import makeresources, frontpage, runserver, setup_logger
from ferenda.errors import DocumentRemovedError, ParseError, FSMStateError

config = {'datadir':'netstandards/exampledata', 
          'loglevel':'DEBUG',
          'force':False,
          'storetype':'SQLITE',
          'storelocation':'netstandards/exampledata/netstandards.sqlite',
          'storerepository':'netstandards',
          'downloadmax': 50 # remove this to download everything
}
setup_logger(level='DEBUG')

# Set up two document repositories
docrepos = (RFC(**config), W3Standards(**config))

for docrepo in docrepos:
    # Download a bunch of documents
    docrepo.download()
    
    # Parse all downloaded documents
    for basefile in docrepo.store.list_basefiles_for("parse"):
        try:
            docrepo.parse(basefile)
        except ParseError as e:
            pass  # or handle this in an appropriate way

    # Index the text content and metadata of all parsed documents
    for basefile in docrepo.store.list_basefiles_for("relate"):
        docrepo.relate(basefile, docrepos)

# Prepare various assets for web site navigation
makeresources(docrepos,
              resourcedir="netstandards/exampledata/rsrc",
              sitename="Netstandards",
              sitedescription="A repository of internet standard documents")

# Relate for all repos must run before generate for any repo
for docrepo in docrepos:
    # Generate static HTML files from the parsed documents, 
    # with back- and forward links between them, etc.
    for basefile in docrepo.store.list_basefiles_for("generate"):
        docrepo.generate(basefile)
        
    # Generate a table of contents of all available documents
    docrepo.toc()
    # Generate feeds of new and updated documents, in HTML and Atom flavors
    docrepo.news()

# Create a frontpage for the entire site
frontpage(docrepos,path="netstandards/exampledata/index.html")

# Start WSGI app at http://localhost:8000/ with navigation,
# document viewing, search and API
# runserver(docrepos, port=8000, documentroot="netstandards/exampledata")

Alternately, using the command line tools and the project framework:

$ ferenda-setup netstandards
$ cd netstandards
$ ./ferenda-build.py ferenda.sources.tech.RFC enable
$ ./ferenda-build.py ferenda.sources.tech.W3Standards enable
$ ./ferenda-build.py all all --downloadmax=50
# $ ./ferenda-build.py all runserver &
# $ open http://localhost:8000/

Note

(*) actually, it only downloads the 50 most recent of each. Downloading, parsing, indexing and re-generating close to 7000 RFC documents takes several hours. In order to process all documents, remove the downloadmax configuration parameter/command line option, and be prepared to wait. You should also set up an external triple store (see Triple stores) and an external fulltext search engine (see Fulltext search engines).

Prerequisites

Operating system
Ferenda is tested and works on Unix, Mac OS and Windows.
Python
Version 2.6 or newer required, 3.4 recommended. The code base is primarily developed with python 3, and is heavily dependent on all forward compatibility features introduced in Python 2.6. Python 3.0 and 3.1 is not supported.
Third-party libraries
beautifulsoup4, rdflib, html5lib, lxml, requests, whoosh, pyparsing, jsmin, six and their respective requirements. If you install ferenda using easy_install or pip they should be installed automatically. If you’re working with a clone of the source repository you can install them with a simple pip install -r requirements.py3.txt (substitute with requirements.py2.txt if you’re not yet using python 3).
Command-line tools

For some functionality, certain executables must be present and in your $PATH:

  • PDFReader requires pdftotext and pdftohtml (from poppler, version 0.21 or newer).
    • The crop() method requires convert (from ImageMagick).
    • The convert_to_pdf parameter to read() requires the soffice binary from either OpenOffice or LibreOffice
    • The ocr_lang parameter to read() requires tesseract (from tesseract-ocr), convert (see above) and tiffcp (from libtiff)
  • WordReader requires antiword to handle old .doc files.
  • TripleStore can perform some operations (bulk up- and download) much faster if curl is installed.

Once you have a large number of documents and metadata about those documents, you’ll need a RDF triple store, either Sesame (at least version 2.7) or Fuseki (at least version 1.0). For document collections small enough to keep all metadata in memory you can get by with only rdflib, using either a Sqlite or a Berkely DB (aka Sleepycat/bsddb) backend. For further information, see Triple stores.

Similarly, once you have a large collection of text (either many short documents, or fewer long documents), you’ll need an fulltext search engine to use the search feature (enabled by default). For small document collections the embedded whoosh library is used. Right now, ElasticSearch is the only supported external fulltext search engine.

As a rule of thumb, if your document collection contains over 100 000 RDF triples or 100 000 words, you should start thinking about setting up an external triple store or a fulltext search engine. See Fulltext search engines.

Installing

Ferenda should preferably be installed with pip (in fact, it’s the only method tested):

pip install ferenda

You should definitely consider installing ferenda in a virtualenv.

Note

If you want to use the Sleepycat/bsddb backend for storing RDF data together with python 3, you need to install the bsddb3 module. Even if you’re using python 2 on Mac OS X, you might need to install this module, as the built-in bsddb module often has problems on this platform. It’s not automatically installed by easy_install/pip as it has requirements of its own and is not essential.

On Windows, we recommend using a binary distribution of lxml. Unfortunately, at the time of writing, no such official distribution is for Python 3.3 or later. However, the inofficial distributions available at http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml has been tested with ferenda on python 3.3 and later, and seems to work great.

The binary distributions installs lxml into the system python library path. To make lxml available for your virtualenv, use the --system-site-packages command line switch when creating the virtualenv.

Features

  • Handles downloading, structural parsing and regeneration of large document collections.
  • Contains libraries to make reading of plain text, MS Word and PDF documents (including scanned text) as easy as HTML.
  • Uses established information standards like XHTML, XSLT, XML namespaces, RDF and SPARQL as much as possible.
  • Leverages your favourite python libraries: requests, beautifulsoup, rdflib, lxml, pyparsing and whoosh.
  • Handle errors in upstream sources by creating one-off patch files for individiual documents.
  • Easy to write reference/citation parsers and run them on document text.
  • Documents in the same and other collections are automatically cross-referenced.
  • Uses caches and dependency management to avoid performing the same work over and over.
  • Once documents are downloaded and structured, you get a usable web site with REST API, Atom feeds and search for free.
  • Web site generation can create a set of static HTML pages for offline use.

Next step

See First steps to set up a project and create your own simple document repository.