Setting up external databases¶
Ferenda stores data in three substantially different ways:
- Documents are stored in the file system
- RDF Metadata is stored in in a triple store
- Document text is stored in a fulltext search engine.
There are many capable and performant triple stores and fulltext search engines available, and ferenda supports a few of them. The default choice for both are embedded solutions (using RDFLib + SQLite for a triple store and Whoosh for a fulltext search engine) so that you can get a small system going without installing and configuring additional server processess. However, these choices do not work well with medium to large datasets, so when you start feeling that indexing and searching is getting slow, you should run an external triplestore and an external fulltext search engine.
If you’re using the project framework, you set the configuration
indextype to new values. You’ll find that
ferenda-setup tool creates a
ferenda.ini that specifies
indextype, based on whether it can find Fuseki,
Sesame and/or ElasticSearch running on their default ports on
localhost. You still might have to do extra configuration,
particularly if you’re using Sesame as a triple store.
If you setup any of the external databases after running
ferenda-setup, or you want to use some other configuration than
ferenda-setup selected for you, you can still set the
configuration values in
ferenda.ini by editing the file as
If you are running any of the external databases, but in a non-default
location (including remote locations) you can set the environment
FERENDA_FULLTEXTINDEX_LOCATION to the full URL before running
There are four choices.
RDFLib + SQLite¶
[__root__] storetype = SQLITE storelocation = data/ferenda.sqlite # single file storerepository = <projectname>
This is the simplest way to get up and running, requiring no configuration or installs on any platform.
RDFLib + Sleepycat (aka
[__root__] storetype = SLEEPYCAT storelocation = data/ferenda.db # directory storerepository = <projectname>
This requires that
bsddb (part of the standard library for python 2) or
bsddb3 (separate package) is available and working (which can be a bit of pain on many platforms). Furthermore it’s less stable and slower than RDFLib + SQLite, so it can’t really be recommended. But since it’s the only persistant storage directly supported by RDFLib, it’s supported by Ferenda as well.
[__root__] storetype = SESAME storelocation = http://localhost:8080/openrdf-sesame storerepository = <projectname>
Sesame is a framework and a set of java web applications that normally runs within a Tomcat application server. If you’re comfortable with Tomcat and servlet containers you can get started with this quickly, see their installation instructions. You’ll need to install both the actual Sesame Server and the OpenRDF workbench.
After installing it and configuring
ferenda.ini to use it, you’ll need to use the OpenRDF workbench app (at
http://localhost:8080/openrdf-workbench by default) to create a new repository. The recommended settings are:
Type: Native Java store ID: <projectname> # eg same as storerepository in ferenda.ini Title: Ferenda repository for <projectname> Triple indexes: spoc,posc,cspo,opsc,psoc
It’s much faster than the RDFLib-based stores and is fairly stable (although Ferenda’s usage patterns seem to sometimes make simple operations take a disproportionate amount of time).
[__root__] storetype = SESAME storelocation = http://localhost:3030 storerepository = ds
Fuseki is a simple java server that implements most SPARQL standards and can be run without any complicated setup. It can keep data purely in memory or store it on disk. The above configuration works with the default configuration of Fuseki - just download it and run
Fuseki seems to be the fastest triple store that Ferenda supports, at least with Ferendas usage patterns. Since it’s also the easiest to set up, it’s the recommended triple store once RDFLib + SQLite isn’t enough.
Fulltext search engines¶
There are two choices.
[__root__] indextype = WHOOSH indexlocation = data/whooshindex
Whoosh is an embedded python fulltext search engine, which requires no setup (it’s automatically installed when installing ferenda with
easy_install), works reasonably well with small to medium amounts of data, and performs quick searches. However, once the index grows beyond a few hundred MB, indexing of new material begins to slow down.
[__root__] indextype = ELASTICSEARCH indexlocation = http://localhost:9200/ferenda/
Elasticsearch is a distributed fulltext search engine in java which can run in a distributed fashion and which is accessed through a simple JSON/REST API. It’s easy to setup – just download it and run
bin/elasticsearch as per the instructions. Ferenda’s support for Elasticsearch is new and not yet stable, but it should be able to handle much larger amounts of data.