Setting up external databases

Ferenda stores data in three substantially different ways:

  • Documents are stored in the file system
  • RDF Metadata is stored in in a triple store
  • Document text is stored in a fulltext search engine.

There are many capable and performant triple stores and fulltext search engines available, and ferenda supports a few of them. The default choice for both are embedded solutions (using RDFLib + SQLite for a triple store and Whoosh for a fulltext search engine) so that you can get a small system going without installing and configuring additional server processess. However, these choices do not work well with medium to large datasets, so when you start feeling that indexing and searching is getting slow, you should run an external triplestore and an external fulltext search engine.

If you’re using the project framework, you set the configuration values storetype and indextype to new values. You’ll find that the ferenda-setup tool creates a ferenda.ini that specifies storetype and indextype, based on whether it can find Fuseki, Sesame and/or ElasticSearch running on their default ports on localhost. You still might have to do extra configuration, particularly if you’re using Sesame as a triple store.

If you setup any of the external databases after running ferenda-setup, or you want to use some other configuration than what ferenda-setup selected for you, you can still set the configuration values in ferenda.ini by editing the file as described below.

If you are running any of the external databases, but in a non-default location (including remote locations) you can set the environment variables FERENDA_TRIPLESTORE_LOCATION and/or FERENDA_FULLTEXTINDEX_LOCATION to the full URL before running ferenda-setup.

Triple stores

There are four choices.

RDFLib + SQLite

In ferenda.ini:

[__root__]
storetype = SQLITE
storelocation = data/ferenda.sqlite # single file
storerepository = <projectname>

This is the simplest way to get up and running, requiring no configuration or installs on any platform.

RDFLib + Sleepycat (aka bsddb)

In ferenda.ini:

[__root__]
storetype = SLEEPYCAT
storelocation = data/ferenda.db # directory
storerepository = <projectname>

This requires that bsddb (part of the standard library for python 2) or bsddb3 (separate package) is available and working (which can be a bit of pain on many platforms). Furthermore it’s less stable and slower than RDFLib + SQLite, so it can’t really be recommended. But since it’s the only persistant storage directly supported by RDFLib, it’s supported by Ferenda as well.

Sesame

In ferenda.ini:

[__root__]
storetype = SESAME
storelocation = http://localhost:8080/openrdf-sesame
storerepository = <projectname>

Sesame is a framework and a set of java web applications that normally runs within a Tomcat application server. If you’re comfortable with Tomcat and servlet containers you can get started with this quickly, see their installation instructions. You’ll need to install both the actual Sesame Server and the OpenRDF workbench.

After installing it and configuring ferenda.ini to use it, you’ll need to use the OpenRDF workbench app (at http://localhost:8080/openrdf-workbench by default) to create a new repository. The recommended settings are:

Type: Native Java store
ID: <projectname> # eg same as storerepository in ferenda.ini
Title: Ferenda repository for <projectname>
Triple indexes: spoc,posc,cspo,opsc,psoc

It’s much faster than the RDFLib-based stores and is fairly stable (although Ferenda’s usage patterns seem to sometimes make simple operations take a disproportionate amount of time).

Fuseki

In ferenda.ini:

[__root__]
storetype = SESAME
storelocation = http://localhost:3030
storerepository = ds

Fuseki is a simple java server that implements most SPARQL standards and can be run without any complicated setup. It can keep data purely in memory or store it on disk. The above configuration works with the default configuration of Fuseki - just download it and run fuseki-server

Fuseki seems to be the fastest triple store that Ferenda supports, at least with Ferendas usage patterns. Since it’s also the easiest to set up, it’s the recommended triple store once RDFLib + SQLite isn’t enough.

Fulltext search engines

There are two choices.

Whoosh

In ferenda.ini:

[__root__]
indextype = WHOOSH
indexlocation = data/whooshindex

Whoosh is an embedded python fulltext search engine, which requires no setup (it’s automatically installed when installing ferenda with pip or easy_install), works reasonably well with small to medium amounts of data, and performs quick searches. However, once the index grows beyond a few hundred MB, indexing of new material begins to slow down.

Elasticsearch

In ferenda.ini:

[__root__]
indextype = ELASTICSEARCH
indexlocation = http://localhost:9200/ferenda/

Elasticsearch is a distributed fulltext search engine in java which can run in a distributed fashion and which is accessed through a simple JSON/REST API. It’s easy to setup – just download it and run bin/elasticsearch as per the instructions. Ferenda’s support for Elasticsearch is new and not yet stable, but it should be able to handle much larger amounts of data.