Introduction to Ferenda¶
Ferenda is a python library and framework for transforming unstructured document collections into structured Linked Data. It helps with downloading documents, parsing them to add explicit semantic structure and RDF-based metadata, finding relationships between documents, and republishing the results.
It uses the XHTML and RDFa standards for representing semantic structure, and republishes content using Linked Data principles and a REST-based API.
Ferenda works best for large document collections that have some degree of internal standardization, such as the laws of a particular country, technical standards, or reports published in a series. It is particularly useful for collections that contains explicit references between documents, within or across collections.
It is designed to make it easy to get started with basic downloading, parsing and republishing of documents, and then to improve each step incrementally.
Example¶
Ferenda can be used either as a library or as a command-line tool. This code uses the Ferenda API to create a website containing all(*) RFCs and W3C recommended standards.
from ferenda.sources.tech import RFC, W3Standards
from ferenda.manager import makeresources, frontpage, runserver, setup_logger
from ferenda.errors import DocumentRemovedError, ParseError, FSMStateError
config = {'datadir':'netstandards/exampledata',
'loglevel':'DEBUG',
'force':False,
'storetype':'SQLITE',
'storelocation':'netstandards/exampledata/netstandards.sqlite',
'storerepository':'netstandards',
'downloadmax': 50 # remove this to download everything
}
setup_logger(level='DEBUG')
# Set up two document repositories
docrepos = (RFC(**config), W3Standards(**config))
for docrepo in docrepos:
# Download a bunch of documents
docrepo.download()
# Parse all downloaded documents
for basefile in docrepo.store.list_basefiles_for("parse"):
try:
docrepo.parse(basefile)
except ParseError as e:
pass # or handle this in an appropriate way
# Index the text content and metadata of all parsed documents
for basefile in docrepo.store.list_basefiles_for("relate"):
docrepo.relate(basefile, docrepos)
# Prepare various assets for web site navigation
makeresources(docrepos,
resourcedir="netstandards/exampledata/rsrc",
sitename="Netstandards",
sitedescription="A repository of internet standard documents")
# Relate for all repos must run before generate for any repo
for docrepo in docrepos:
# Generate static HTML files from the parsed documents,
# with back- and forward links between them, etc.
for basefile in docrepo.store.list_basefiles_for("generate"):
docrepo.generate(basefile)
# Generate a table of contents of all available documents
docrepo.toc()
# Generate feeds of new and updated documents, in HTML and Atom flavors
docrepo.news()
# Create a frontpage for the entire site
frontpage(docrepos,path="netstandards/exampledata/index.html")
# Start WSGI app at http://localhost:8000/ with navigation,
# document viewing, search and API
# runserver(docrepos, port=8000, documentroot="netstandards/exampledata")
Alternately, using the command line tools and the project framework:
$ ferenda-setup netstandards
$ cd netstandards
$ ./ferenda-build.py ferenda.sources.tech.RFC enable
$ ./ferenda-build.py ferenda.sources.tech.W3Standards enable
$ ./ferenda-build.py all all --downloadmax=50
# $ ./ferenda-build.py all runserver &
# $ open http://localhost:8000/
Note
(*) actually, it only downloads the 50 most recent of
each. Downloading, parsing, indexing and re-generating close to
7000 RFC documents takes several hours. In order to process all
documents, remove the downloadmax
configuration
parameter/command line option, and be prepared to wait. You should
also set up an external triple store (see Triple stores) and
an external fulltext search engine (see Fulltext search engines).
Prerequisites¶
- Operating system
- Ferenda is tested and works on Unix, Mac OS and Windows.
- Python
- Version 2.6 or newer required, 3.4 recommended. The code base is primarily developed with python 3, and is heavily dependent on all forward compatibility features introduced in Python 2.6. Python 3.0 and 3.1 is not supported.
- Third-party libraries
beautifulsoup4
,rdflib
,html5lib
,lxml
,requests
,whoosh
,pyparsing
,jsmin
,six
and their respective requirements. If you install ferenda usingeasy_install
orpip
they should be installed automatically. If you’re working with a clone of the source repository you can install them with a simplepip install -r requirements.py3.txt
(substitute withrequirements.py2.txt
if you’re not yet using python 3).- Command-line tools
For some functionality, certain executables must be present and in your
$PATH
:PDFReader
requirespdftotext
andpdftohtml
(from poppler, version 0.21 or newer).- The
crop()
method requiresconvert
(from ImageMagick). - The
convert_to_pdf
parameter toread()
requires thesoffice
binary from either OpenOffice or LibreOffice - The
ocr_lang
parameter toread()
requirestesseract
(from tesseract-ocr),convert
(see above) andtiffcp
(from libtiff)
- The
WordReader
requires antiword to handle old.doc
files.TripleStore
can perform some operations (bulk up- and download) much faster if curl is installed.
Once you have a large number of documents and metadata about those documents, you’ll need a RDF triple store, either Sesame (at least version 2.7) or Fuseki (at least version 1.0). For document collections small enough to keep all metadata in memory you can get by with only rdflib, using either a Sqlite or a Berkely DB (aka Sleepycat/bsddb) backend. For further information, see Triple stores.
Similarly, once you have a large collection of text (either many short documents, or fewer long documents), you’ll need an fulltext search engine to use the search feature (enabled by default). For small document collections the embedded whoosh library is used. Right now, ElasticSearch is the only supported external fulltext search engine.
As a rule of thumb, if your document collection contains over 100 000 RDF triples or 100 000 words, you should start thinking about setting up an external triple store or a fulltext search engine. See Fulltext search engines.
Installing¶
Ferenda should preferably be installed with pip (in fact, it’s the only method tested):
pip install ferenda
You should definitely consider installing ferenda in a virtualenv.
Note
If you want to use the Sleepycat/bsddb backend for storing RDF data
together with python 3, you need to install the bsddb3
module. Even if you’re using python 2 on Mac OS X, you might
need to install this module, as the built-in bsddb
module often
has problems on this platform. It’s not automatically installed by
easy_install
/pip
as it has requirements of its own and is
not essential.
On Windows, we recommend using a binary distribution of
lxml
. Unfortunately, at the time of writing, no such official
distribution is for Python 3.3 or later. However, the inofficial
distributions available at
http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml has been tested
with ferenda on python 3.3 and later, and seems to work great.
The binary distributions installs lxml into the system python
library path. To make lxml available for your virtualenv, use the
--system-site-packages
command line switch when creating the
virtualenv.
Features¶
- Handles downloading, structural parsing and regeneration of large document collections.
- Contains libraries to make reading of plain text, MS Word and PDF documents (including scanned text) as easy as HTML.
- Uses established information standards like XHTML, XSLT, XML namespaces, RDF and SPARQL as much as possible.
- Leverages your favourite python libraries: requests, beautifulsoup, rdflib, lxml, pyparsing and whoosh.
- Handle errors in upstream sources by creating one-off patch files for individiual documents.
- Easy to write reference/citation parsers and run them on document text.
- Documents in the same and other collections are automatically cross-referenced.
- Uses caches and dependency management to avoid performing the same work over and over.
- Once documents are downloaded and structured, you get a usable web site with REST API, Atom feeds and search for free.
- Web site generation can create a set of static HTML pages for offline use.
Next step¶
See First steps to set up a project and create your own simple document repository.