Setting up external databases¶
Ferenda stores data in three substantially different ways:
- Documents are stored in the file system
- RDF Metadata is stored in in a triple store
- Document text is stored in a fulltext search engine.
There are many capable and performant triple stores and fulltext search engines available, and ferenda supports a few of them. The default choice for both are embedded solutions (using RDFLib + SQLite for a triple store and Whoosh for a fulltext search engine) so that you can get a small system going without installing and configuring additional server processess. However, these choices do not work well with medium to large datasets, so when you start feeling that indexing and searching is getting slow, you should run an external triplestore and an external fulltext search engine.
If you’re using the project framework, you set the configuration values storetype and indextype to new values. You’ll find that the ferenda-setup tool creates a ferenda.ini that specifies storetype and indextype, based on whether it can find Fuseki, Sesame and/or ElasticSearch running on their default ports on localhost. You still might have to do extra configuration, particularly if you’re using Sesame as a triple store.
If you setup any of the external databases after running ferenda-setup, or you want to use some other configuration than what ferenda-setup selected for you, you can still set the configuration values in ferenda.ini by editing the file as described below.
If you are running any of the external databases, but in a non-default location (including remote locations) you can set the environment variables FERENDA_TRIPLESTORE_LOCATION and/or FERENDA_FULLTEXTINDEX_LOCATION to the full URL before running ferenda-setup.
There are four choices.
RDFLib + SQLite¶
[__root__] storetype = SQLITE storelocation = data/ferenda.sqlite # single file storerepository = <projectname>
This is the simplest way to get up and running, requiring no configuration or installs on any platform.
RDFLib + Sleepycat (aka bsddb)¶
[__root__] storetype = SLEEPYCAT storelocation = data/ferenda.db # directory storerepository = <projectname>
This requires that bsddb (part of the standard library for python 2) or bsddb3 (separate package) is available and working (which can be a bit of pain on many platforms). Furthermore it’s less stable and slower than RDFLib + SQLite, so it can’t really be recommended. But since it’s the only persistant storage directly supported by RDFLib, it’s supported by Ferenda as well.
[__root__] storetype = SESAME storelocation = http://localhost:8080/openrdf-sesame storerepository = <projectname>
Sesame is a framework and a set of java web applications that normally runs within a Tomcat application server. If you’re comfortable with Tomcat and servlet containers you can get started with this quickly, see their installation instructions. You’ll need to install both the actual Sesame Server and the OpenRDF workbench.
After installing it and configuring ferenda.ini to use it, you’ll need to use the OpenRDF workbench app (at http://localhost:8080/openrdf-workbench by default) to create a new repository. The recommended settings are:
Type: Native Java store ID: <projectname> # eg same as storerepository in ferenda.ini Title: Ferenda repository for <projectname> Triple indexes: spoc,posc,cspo,opsc,psoc
It’s much faster than the RDFLib-based stores and is fairly stable (although Ferenda’s usage patterns seem to sometimes make simple operations take a disproportionate amount of time).
[__root__] storetype = SESAME storelocation = http://localhost:3030 storerepository = ds
Fuseki is a simple java server that implements most SPARQL standards and can be run without any complicated setup. It can keep data purely in memory or store it on disk. The above configuration works with the default configuration of Fuseki - just download it and run fuseki-server
Fuseki seems to be the fastest triple store that Ferenda supports, at least with Ferendas usage patterns. Since it’s also the easiest to set up, it’s the recommended triple store once RDFLib + SQLite isn’t enough.
Fulltext search engines¶
There are two choices.
[__root__] indextype = WHOOSH indexlocation = data/whooshindex
Whoosh is an embedded python fulltext search engine, which requires no setup (it’s automatically installed when installing ferenda with pip or easy_install), works reasonably well with small to medium amounts of data, and performs quick searches. However, once the index grows beyond a few hundred MB, indexing of new material begins to slow down.
[__root__] indextype = ELASTICSEARCH indexlocation = http://localhost:9200/ferenda/
Elasticsearch is a distributed fulltext search engine in java which can run in a distributed fashion and which is accessed through a simple JSON/REST API. It’s easy to setup – just download it and run bin/elasticsearch as per the instructions. Ferenda’s support for Elasticsearch is new and not yet stable, but it should be able to handle much larger amounts of data.