Grouping documents with facets¶
A collection of documents can be arranged in a set of groups, such as
by year of publication, by document author, or by keyword. With Ferenda,
each such method of grouping is described in the form of a
Facet. By providing a list of Facet objects in
facets() method, your docrepo
can specify multiple ways of arranging the documents it’s
handling. These facets are used to construct a static Table of
contents for your site, as well as creating Atom feeds of all
documents and defining the fields available for querying when using
the REST API.
A facet object is initialized with a set of parameters that, taken together, define the method of grouping. These include the RDF predicate that contains the data used for grouping, the datatype to be used for that data, functions (or other callables) that sorts the data into discrete groups, and other parameters that affect eg. the sorting order or if a particular facet is used in a particular context.
Facets are used in several different contexts (see below) but the general steps for applying them are similar. First, all the data that might be needed by the total set of facets is collected. This is normally done by querying the triple store for it. Each facet contains information about which RDF predicate
Once this set of data is retrieved, as a giant table with one row for each resource (document), each facet is used to create a set of groups and place each document in zero or more of these groups.
Selectors and identificators¶
The grouping is primarily done through a selector function. The selector function recieves three arguments:
- a dict with some basic information about one document (corresponding to one row),
- the name of the current facet (binding), and
- optionally some repo-dependent extra data in the form of an RDF graph.
It should return a single string, which should be a human-readable
label for a grouping. The selector is called once for every document
in the docrepo, and each document is sorted in one (or more, see
below) group identified by that string. As a simple example, a
selector may group documents into years of publication by finding the
date of the
dcterms:issued property and extracting the year part
of it. The string returned by the should be suitable for end-user
Each facet also has a similar function called the identificator function. It recieves the same arguments as the selector function, but should return a string that is well suited for eg. a URI fragment, ie. not contain spaces or non-ascii characters.
Facet class has a number of classmethods that
can act as selectors and/or identificators.
Contexts where facets are used¶
Table of contents¶
Each docrepo will have their own set of Table of contents pages. The
TOC for a docrepo will contain one set of pages for each defined
use_for_toc is set to
Each docrepo will have a set of feedsets, where each feedset is based
on a facet (only those that has the property
use_for_feed set to
True). The structure of each feedset will mirror the structure of
each set of TOC pages, and re-uses the same selector and identificator
methods. It makes sense to have a separate feed for eg. each publisher
or subject matter in a repository that comprises a reasonable amount
of publishers and subject matters (using
dcterms:subject as the base for facets), but it does not make much
sense to eg. have a feed for all documents published in 1975 (using
dcterms:published as the base for a facet). Therefore, the default
Furthermore, a “main” feedset with a single feed containing all documents is also constructed.
The feeds are always sorted by the updated property (most recent
updated first), taken from the corresponding
The fulltext index¶
The metadata that each facet uses is stored as a separate field in the fulltext index. Facet can specify exactly how a particular facet should be stored (ie if the field should be boosted in any particular way). Note that the data stored in the fulltext index is not passed through the selector function, the original RDF data is stored as-is.
The ReST API¶
The ReST API uses all defined facets for all repos
simultaneously. This means that you can query eg. all documents
published in a certain year, and get results from all docrepos. This
requires that the defined facets don’t clash, eg. that you don’t have
two facets based on
dcterms:publisher where one uses URI
references and the other uses.
Grouping a document in several groups¶
If a docrepo uses a facet that has
multiple_values set to
True, it’s possible for that facet to categorize the document in
more than one group (a typical usecase is documents that have multiple
dcterms:subject keywords, or articles that have multiple
Combining facets from different docrepos¶
Facets that map to the same fulltextindex field must be equal. The
rules for equality: If the
rdftype and the
selector is equal, then the facets are
selector functions are only equal if they are the same function
object, ie it’s not just enough that they are two functions that work