Grouping documents with facets¶
A collection of documents can be arranged in a set of groups, such as
by year of publication, by document author, or by keyword. With Ferenda,
each such method of grouping is described in the form of a
Facet
. By providing a list of Facet objects in
its facets()
method, your docrepo
can specify multiple ways of arranging the documents it’s
handling. These facets are used to construct a static Table of
contents for your site, as well as creating Atom feeds of all
documents and defining the fields available for querying when using
the REST API.
A facet object is initialized with a set of parameters that, taken together, define the method of grouping. These include the RDF predicate that contains the data used for grouping, the datatype to be used for that data, functions (or other callables) that sorts the data into discrete groups, and other parameters that affect eg. the sorting order or if a particular facet is used in a particular context.
Applying facets¶
Facets are used in several different contexts (see below) but the general steps for applying them are similar. First, all the data that might be needed by the total set of facets is collected. This is normally done by querying the triple store for it. Each facet contains information about which RDF predicate
Once this set of data is retrieved, as a giant table with one row for each resource (document), each facet is used to create a set of groups and place each document in zero or more of these groups.
Selectors and identificators¶
The grouping is primarily done through a selector function. The selector function recieves three arguments:
- a dict with some basic information about one document (corresponding to one row),
- the name of the current facet (binding), and
- optionally some repo-dependent extra data in the form of an RDF graph.
It should return a single string, which should be a human-readable
label for a grouping. The selector is called once for every document
in the docrepo, and each document is sorted in one (or more, see
below) group identified by that string. As a simple example, a
selector may group documents into years of publication by finding the
date of the dcterms:issued
property and extracting the year part
of it. The string returned by the should be suitable for end-user
display.
Each facet also has a similar function called the identificator function. It recieves the same arguments as the selector function, but should return a string that is well suited for eg. a URI fragment, ie. not contain spaces or non-ascii characters.
The Facet
class has a number of classmethods that
can act as selectors and/or identificators.
Contexts where facets are used¶
Table of contents¶
Each docrepo will have their own set of Table of contents pages. The
TOC for a docrepo will contain one set of pages for each defined
facet, unless use_for_toc
is set to False
.
Atom feeds¶
Each docrepo will have a set of feedsets, where each feedset is based
on a facet (only those that has the property use_for_feed
set to
True
). The structure of each feedset will mirror the structure of
each set of TOC pages, and re-uses the same selector and identificator
methods. It makes sense to have a separate feed for eg. each publisher
or subject matter in a repository that comprises a reasonable amount
of publishers and subject matters (using dcterms:publisher
or
dcterms:subject
as the base for facets), but it does not make much
sense to eg. have a feed for all documents published in 1975 (using
dcterms:published
as the base for a facet). Therefore, the default
value for use_for_feed
is False
.
Furthermore, a “main” feedset with a single feed containing all documents is also constructed.
The feeds are always sorted by the updated property (most recent
updated first), taken from the corresponding
DocumentEntry
object.
The fulltext index¶
The metadata that each facet uses is stored as a separate field in the fulltext index. Facet can specify exactly how a particular facet should be stored (ie if the field should be boosted in any particular way). Note that the data stored in the fulltext index is not passed through the selector function, the original RDF data is stored as-is.
The ReST API¶
The ReST API uses all defined facets for all repos
simultaneously. This means that you can query eg. all documents
published in a certain year, and get results from all docrepos. This
requires that the defined facets don’t clash, eg. that you don’t have
two facets based on dcterms:publisher
where one uses URI
references and the other uses.
Grouping a document in several groups¶
If a docrepo uses a facet that has multiple_values
set to
True
, it’s possible for that facet to categorize the document in
more than one group (a typical usecase is documents that have multiple
dcterms:subject
keywords, or articles that have multiple
dcterms:creator
authors).
Combining facets from different docrepos¶
Facets that map to the same fulltextindex field must be equal. The
rules for equality: If the rdftype
and the dimension_type
and
dimension_label
and selector
is equal, then the facets are
equal. selector
functions are only equal if they are the same function
object, ie it’s not just enough that they are two functions that work
identically.