The `DocumentRepository` class¶

class ferenda.DocumentRepository(config=None, **kwargs)[source]¶

Base class for downloading, parsing and generating HTML versions of a repository of documents.

Start building your application by subclassing this class, and then override methods in order to customize the downloading, parsing and generation behaviour.

Parameters:	**kwargs – Any named argument overrides any similarly-named configuration file parameter.

Example:

>>> class MyRepo(DocumentRepository):
...     alias="myrepo"
...
>>> d = MyRepo(datadir="/tmp/ferenda")
>>> d.store.downloaded_path("mybasefile").replace(os.sep,'/')
'/tmp/ferenda/myrepo/downloaded/mybasefile.html'

Note

This class has a ridiculous amount of properties and methods that you can override to control most of Ferendas behaviour in all stages. For basic usage, you need only a fraction of them. Please don’t be intimidated/horrified.

downloaded_suffix = '.html'¶: File suffix for the main document format. Determines the suffix of downloaded files.

storage_policy = 'file'¶: Some repositories have documents in several formats, documents split amongst several files or embedded resources. If storage_policy is set to dir, then each document gets its own directory (the default filename being index +suffix), otherwise each doc gets stored as a file in a directory with other files. Affects ferenda.DocumentStore.path() (and therefore all other *_path methods)

alias = 'base'¶: A short name for the class, used by the command line ferenda-build.py tool. Also determines where to store downloaded, parsed and generated files. When you subclass DocumentRepository you must override this.

namespaces = ['rdf', 'rdfs', 'xsd', 'xsi', 'dcterms', 'skos', 'foaf', 'xhv', 'owl', 'prov', 'bibo']¶

The namespaces that are included in the XHTML and RDF files generated by parse(). This can be a list of strings, in which case the strings are assumed to be well-known prefixes to established namespaces, or a list of (prefix, namespace) tuples. All well-known prefixes are available in ferenda.util.ns.

If you specify a namespace for a well-known ontology/vocabulary, that onlology will be available as a Graph from the ontologies property.

required_predicates = [rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')]¶: A list of RDF predicates that should be present in the outdata. If any of these are missing from the result of parse(), a warning is logged. You can add to this list as a form of simple validation of your parsed data.

start_url = 'http://example.org/'¶: The main entry page for the remote web store of documents. May be a list of documents, a search form or whatever. If it’s something more complicated than a simple list of documents, you need to override download() in order to tell which documents are to be downloaded.

document_url_template = 'http://example.org/docs/%(basefile)s.html'¶: A string template for creating URLs for individual documents on the remote web server. Directly used by remote_url() and indirectly by download_single().

document_url_regex = 'http://example.org/docs/(?P<basefile>\\w+).html'¶: A regex that matches URLs for individual documents – the reverse of what document_url_template is used for. Used by download() to find suitable links if basefile_regex doesn’t match. Must define the named group basefile using the (?P<basefile>...) syntax

basefile_regex = '^ID: ?(?P<basefile>[\\w\\d\\:\\/]+)$'¶: A regex for matching document names in link text, as used by download(). Must define a named group basefile, just like document_url_template.

rdf_type = rdflib.term.URIRef('http://xmlns.com/foaf/0.1/Document')¶: The RDF type of the documents you are handling (expressed as a rdflib.term.URIRef object).

Note

If your repo produces documents of several different types, you can define this as a list (or other iterable) of URIRef objects. faceted_data() will only find documents that are any of the types.

source_encoding = 'utf-8'¶: The character set that the source HTML documents use (if applicable).

lang = 'en'¶: The language which the source documents are assumed to be written in (unless otherwise specified), and the language which output document should use.

parse_content_selector = 'body'¶: CSS selector used to select the main part of the document content by the default parse() implementation.

parse_filter_selectors = ['script']¶: CSS selectors used to filter/remove certain parts of the document content by the default parse() implementation.

xslt_template = 'res/xsl/generic.xsl'¶: A template used by generate() to transform the XML file into browser-ready HTML. If your document type is complex, you might want to override this (and write your own XSLT transform). You should include base.xslt in that template, though.

sparql_annotations = 'res/sparql/annotations.rq'¶: A template SPARQL CONSTRUCT query for document annotations.

documentstore_class¶: alias of ferenda.documentstore.DocumentStore

ontologies¶

Provides a Graph loaded with the ontologies/vocabularies that this docrepo uses (as determined by the namespaces` property).

If you’re using your own vocabularies, you can place them (in Turtle format) as res/vocab/[prefix].ttl to have them loaded into the graph.

Note

Some system-like vocabularies (rdf, rdfs and owl) are never loaded into the graph.

commondata¶: Provides a Graph containing any extra data that is common to documents in this docrepo – this can be information about different entities that publishes the documents, the printed series in which they’re published, and so on. The data is taken from res/extra/[repoalias].ttl.

config¶: The LayeredConfig object that contains the current configuration for this docrepo instance. You can read or write individual properties of this object, or replace it with a new LayeredConfig object entirely.

lookup_resource(label, predicate=rdflib.term.URIRef('http://xmlns.com/foaf/0.1/name'), cutoff=0.8, warn=True)[source]¶

Given a textual identifier (ie. the name for something), lookup the canonical uri for that thing in the RDF graph containing extra data (i.e. the graph that commondata provides). The graph should have a foaf:name`` statement about the url with the sought label as the object.

Since data is imperfect, the textual label may be spelled or expressed different in different contexts. This method therefore performs fuzzy matching (using difflib.get_close_matches()) using the cutoff parameter determines exactly how fuzzy this matching is.

If no resource matches the given label, a KeyError is raised.

Parameters:	label (str) – The textual label to lookup predicate (rdflib.term.URIRef) – The RDF predicate to use when looking for the label cutoff (float) – How fuzzy the matching may be (1 = must match exactly, 0 = anything goes) warn (bool) – Whether to log a warning when an inexact match is performed
Returns:	The matching resource
Return type:	rdflib.term.URIRef

get_default_options()[source]¶

Returns the class’ configuration default configuration properties. These can be overridden by a configution file, or by named arguments to __init__(). See Configuration for a list of standard configuration properties (your subclass is free to define and use additional configuration properties).

Returns:	default configuration properties
Return type:	dict

classmethod setup(action, config)[source]¶: Runs before any of the *_all methods starts executing. It just calls the appropriate setup method, ie if action is parse, then this method calls parse_all_setup (if defined) with the config object as single parameter.

classmethod teardown(action, config)[source]¶: Runs after any of the *_all methods has finished executing. It just calls the appropriate teardown method, ie if action is parse, then this method calls parse_all_teardown (if defined) with the config object as single parameter.

get_archive_version(basefile)[source]¶

Get a version identifier for the current version of the document identified by basefile.

The default implementation simply increments most recent archived version identifier, starting at “1”. If versions in your docrepo are normally identified in some other way (such as SCM revision numbers, dates or similar) you should override this method to return those identifiers.

Parameters:	basefile (str) – The basefile of the document to archive
Returns:	The version identifier for the current version of the document.
Return type:	str

qualified_class_name()[source]¶

The qualified class name of this class

Returns:	class name (e.g. `ferenda.DocumentRepository`)
Return type:	str

canonical_uri(basefile)[source]¶

The canonical URI for the document identified by basefile.

Returns:	The canonical URI
Return type:	str

dataset_uri(param=None, value=None)[source]¶

Returns the URI that identifies the dataset that this docrepository provides. The default implementation is based on the url config parameter and the alias attribute of the class, c.f. http://localhost:8000/dataset/base.

Parameters:	param – An optional parameter name represeting a way of createing a subset of the dataset (eg. all document whose title starts with a particular letter) value – A value for param (eg. “a”)

>>> d = DocumentRepository()
>>> d.alias
'base'
>>> d.config.url = "http://example.org/"
>>> d.dataset_uri()
'http://example.org/dataset/base'
>>> d.dataset_uri("title","a")
'http://example.org/dataset/base?title=a'

basefile_from_uri(uri)[source]¶

The reverse of canonical_uri(). Returns None if the uri doesn’t map to a basefile in this repo.

>>> d = DocumentRepository()
>>> d.alias
'base'
>>> d.config.url = "http://example.org/"
>>> d.basefile_from_uri("http://example.org/res/base/123/a")
'123/a'
>>> d.basefile_from_uri("http://example.org/res/base/123/a#S1")
'123/a'
>>> d.basefile_from_uri("http://example.org/res/other/123/a") # None

dataset_params_from_uri(uri)[source]¶

Given a parametrized dataset URI, return the parameter and value used (or an empty tuple, if it is a dataset URI handled by this repo, but without any parameters).

>>> d = DocumentRepository()
>>> d.alias
'base'
>>> d.config.url = "http://example.org/"
>>> d.dataset_params_from_uri("http://example.org/dataset/base?title=a")
('title', 'a')
>>> d.dataset_params_from_uri("http://example.org/dataset/base")
()

download(basefile=None)[source]¶

Downloads all documents from a remote web service.

The default generic implementation assumes that all documents are linked from a single page (which has the url of start_url), that they all have URLs matching the document_url_regex or that the link text is always equal to basefile (as determined by basefile_regex). If these assumptions don’t hold, you need to override this method.

If you do override it, your download method should read and set the lastdownload parameter to either the datetime of the last download or any other module-specific string (id number or similar).

You should also read the refresh parameter. If it is True (the default), then you should call download_single() for every basefile you encounter, even though they may already exist in some form on disk. download_single() will normally be using conditional GET to see if there is a newer version available.

See Writing your own download implementation for more details.

Returns:	True if any document was downloaded, False otherwise.
Return type:	bool

download_get_basefiles(source)[source]¶: Given source (a iterator that provides (element, attribute, link, pos) tuples, like lxml.etree.iterlinks()), generate tuples (basefile, link) for all document links found in source.

download_single(basefile, url=None)[source]¶

Downloads the document from the web (unless explicitly specified, the URL to download is determined by document_url_template combined with basefile, the location on disk is determined by the function downloaded_path()).

If the document exists on disk, but the version on the web is unchanged (determined using a conditional GET), the file on disk is left unchanged (i.e. the timestamp is not modified).

Parameters:	basefile (string) – The basefile of the document to download url (str) – The URL to download (optional)
Returns:	`True` if the document was downloaded and stored on disk, `False` if the file on disk was not updated.

download_if_needed(url, basefile, archive=True, filename=None, sleep=1)[source]¶

Downloads a remote resource to a local file. If a different version is already in place, archive that old version.

Parameters:	url (str) – The url to download basefile (str) – The basefile of the document to download archive (bool) – Whether to archive existing older versions of the document, or just delete the previously downloaded file. filename (str) – The filename to download to. If not provided, the filename is derived from the supplied basefile
Returns:	True if the local file was updated (and archived), False otherwise.
Return type:	bool

download_is_different(existing, new)[source]¶: Returns True if the new file is semantically different from the existing file.

remote_url(basefile)[source]¶

Get the URL of the source document at it’s remote location, unless the source document is fetched by other means or if it cannot be computed from basefile only. The default implementation uses document_url_template to calculate the url.

Example:

>>> d = DocumentRepository()
>>> d.remote_url("123/a")
'http://example.org/docs/123/a.html'
>>> d.document_url_template = "http://mysite.org/archive/%(basefile)s/"
>>> d.remote_url("123/a")
'http://mysite.org/archive/123/a/'

Parameters:	basefile (str) – The basefile of the source document
Returns:	The remote url where the document can be fetched, or `None`.
Return type:	str

generic_url(basefile, maindir, suffix)[source]¶

Analogous to ferenda.DocumentStore.path(), calculate the full local url for the given basefile and stage of processing.

Parameters:	basefile (str) – The basefile for which to calculate the local url maindir (str) – The processing stage directory (normally `downloaded`, `parsed`, or `generated`) suffix (str) – The file extension including period (i.e. `.txt`, not `txt`)
Returns:	The local url
Return type:	str

downloaded_url(basefile)[source]¶

Get the full local url for the downloaded file for the given basefile.

Parameters:	basefile (str) – The basefile for which to calculate the local url
Returns:	The local url
Return type:	str

>>> d = DocumentRepository()
>>> d.downloaded_url("123/a")
'http://localhost:8000/base/downloaded/123/a.html'

classmethod parse_all_setup(config)[source]¶: Runs any action needed prior to parsing all documents in a docrepo. The default implementation does nothing.

Note

This is a classmethod for now (and that’s why a config object is passsed as an argument), but might change to a instance method.

classmethod parse_all_teardown(config)[source]¶: Runs any cleanup action needed after parsing all documents in a docrepo. The default implementation does nothing.

Note

Like parse_all_setup() this might change to a instance method.

parseneeded(basefile)[source]¶: Returns True iff there is a need to parse the given basefile. If the resulting parsed file exists and is newer than the downloaded file, there is typically no reason to parse the file.

parse(doc)[source]¶

Parse downloaded documents into structured XML and RDF.

It will also save the same RDF statements in a separate RDF/XML file.

You will need to provide your own parsing logic, but often it’s easier to just override parse_from_soup (assuming your indata is in a HTML format parseable by BeautifulSoup) and let the base class read and write the files.

If your data is not in a HTML format, or BeautifulSoup is not an appropriate parser to use, override this method.

Parameters:	doc (ferenda.Document) – The document object to fill in.

parse_entry_update(doc)[source]¶

parse_entry_title(doc)[source]¶

soup_from_basefile(basefile, encoding='utf-8', parser='lxml')[source]¶

Load the downloaded document for basefile into a BeautifulSoup object

Parameters:	basefile (str) – The basefile for the downloaded document to parse encoding (str) – The encoding of the downloaded document
Returns:	The parsed document as a `BeautifulSoup` object

Note

Helper function. You probably don’t need to override it.

parse_metadata_from_soup(soup, doc)[source]¶

Given a BeautifulSoup document, retrieve all document-level metadata from it and put it into the given doc object’s meta property.

Note

The default implementation sets rdf:type, dcterms:title, dcterms:identifier and prov:wasGeneratedBy properties in doc.meta, as well as setting the language of the document in doc.lang.

Parameters:	soup – A parsed document, as `BeautifulSoup` object doc (ferenda.Document) – Our document
Returns:	None

parse_document_from_soup(soup, doc)[source]¶

Given a BeautifulSoup document, convert it into the provided doc object’s body property as suitable ferenda.elements objects.

Note

The default implementation respects parse_content_selector and parse_filter_selectors.

Parameters:	soup – A parsed document as a `BeautifulSoup` object doc (ferenda.Document) – Our document
Returns:	None

patch_if_needed(basefile, text)[source]¶

Given basefile and the entire text of the downloaded or intermediate document, find if there exists a patch file under self.config.patchdir, and if so, applies it. Returns (patchedtext, patchdescription) if so, (text,None) otherwise.

Parameters:	basefile (str) – The basefile of the text text (bytes) – The text to be patched

make_document(basefile=None)[source]¶

Create a Document objects with basic initialized fields.

Note

Helper method used by the makedocument() decorator.

Parameters:	basefile (str) – The basefile for the document
Return type:	ferenda.Document

make_graph()[source]¶

Initialize a rdflib Graph object with proper namespace prefix bindings (as determined by namespaces)

Return type:	rdflib.Graph

create_external_resources(doc)[source]¶

Optionally create external files that go together with the parsed file (stylesheets, images, etc).

The default implementation does nothing.

Parameters:	doc (ferenda.Document) – The document

render_xhtml(doc, outfile=None)[source]¶

Renders the parsed object structure as a XHTML file with RDFa attributes (also returns the same XHTML as a string).

Parameters:	doc (ferenda.Document) – The document to render outfile (str) – The file name for the XHTML document
Returns:	The XHTML document
Return type:	str

render_xhtml_tree(doc)[source]¶

Renders the parsed object structure as a lxml.etree._Element object.

Parameters:	doc (ferenda.Document) – The document to render
Returns:	The XHTML document as a lxml structure
Return type:	lxml.etree._Element

parsed_url(basefile)[source]¶

Get the full local url for the parsed file for the given basefile.

Parameters:	basefile (str) – The basefile for which to calculate the local url
Returns:	The local url
Return type:	str

distilled_url(basefile)[source]¶

Get the full local url for the distilled RDF/XML file for the given basefile.

Parameters:	basefile (str) – The basefile for which to calculate the local url
Returns:	The local url
Return type:	str

classmethod relate_all_setup(config)[source]¶

Runs any cleanup action needed prior to relating all documents in a docrepo. The default implementation clears the corresponsing context (see dataset_uri()) in the triple store.

Note

Like parse_all_setup() this might change to a instance method.

Returns False if no relation needs to be done (as determined by the timestamp on the dump nt file)

classmethod relate_all_teardown(config)[source]¶: Runs any cleanup action needed after relating all documents in a docrepo. The default implementation dumps all RDF data loaded into the triplestore into one giant N-Triples file.

Note

Like parse_all_setup() this might change to a instance method.

relate(basefile, otherrepos=[])[source]¶: Runs various indexing operations for the document represented by basefile: insert RDF statements into a triple store, add this document to the dependency list to all documents that it refers to, and put the text of the document into a fulltext index.

relate_triples(basefile, removesubjects=False)[source]¶

Insert the (previously distilled) RDF statements into the triple store.

Parameters:	basefile (str) – The basefile for the document containing the RDF statements. removesubjects (bool) – Whether to remove all identified subjects from the triplestore beforehand (to clear the previous version of this basefile’s metadata). FIXME: not yet used
Returns:	None

relate_dependencies(basefile, repos=[])[source]¶: For each document that the basefile document refers to, attempt to find this document in the current or any other docrepo, and add the parsed document path to that documents dependency file.

add_dependency(basefile, dependencyfile)[source]¶: Add the dependencyfile to basefile s dependency file. Returns True if anything new was added, False otherwise

relate_fulltext(basefile, repos=None)[source]¶

Index the text of the document into fulltext index. Also indexes all metadata that facets() indicate should be indexed.

Parameters:	basefile (str) – The basefile for the document to be indexed.
Returns:	None

facets()[source]¶

Provides a list of Facet objects that specify how documents in your docrepo should be grouped.

Override this if you want to specify your own way of grouping data in your docrepo.

faceted_data()[source]¶: Provides a list of dicts, each containing a row of information about a single document in the repository. The exact fields provided are controlled by the list of Facet objects returned by facet().

Note

The same document can occur multiple times if any of it’s facets have multiple_values set, once for each different values that that facet has.

facet_query(context)[source]¶

Constructs a SPARQL SELECT query that fetches all information needed to create faceted data.

Parameters:	context (str) – The context (named graph) to which to limit the query.
Returns:	The SPARQL query
Return type:	str

Example:

>>> d = DocumentRepository()
>>> expected = """PREFIX dcterms: <http://purl.org/dc/terms/>
... PREFIX foaf: <http://xmlns.com/foaf/0.1/>
... PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
... 
... SELECT DISTINCT ?uri ?rdf_type ?dcterms_title ?dcterms_publisher ?dcterms_identifier ?dcterms_issued
... FROM <http://example.org/ctx/base>
... WHERE {
...     ?uri rdf:type foaf:Document .
...     OPTIONAL { ?uri rdf:type ?rdf_type . }
...     OPTIONAL { ?uri dcterms:title ?dcterms_title . }
...     OPTIONAL { ?uri dcterms:publisher ?dcterms_publisher . }
...     OPTIONAL { ?uri dcterms:identifier ?dcterms_identifier . }
...     OPTIONAL { ?uri dcterms:issued ?dcterms_issued . }
... 
... }"""
>>> d.facet_query("http://example.org/ctx/base") == expected
True

facet_select(query)[source]¶

Select all data from the triple store needed to create faceted data.

Parameters:	context (str) – The context (named graph) to restrict the query to. If None, search entire triplestore.
Returns:	The results of the query, as python objects
Return type:	set of dicts

classmethod generate_all_setup(config)[source]¶: Runs any action needed prior to generating all documents in a docrepo. The default implementation does nothing.

Note

Like parse_all_setup() this might change to a instance method.

classmethod generate_all_teardown(config)[source]¶: Runs any cleanup action needed after generating all documents in a docrepo. The default implementation does nothing.

Note

Like parse_all_setup() this might change to a instance method.

generate(basefile, otherrepos=[])[source]¶

Generate a browser-ready HTML file from structured XML and RDF.

Uses the XML and RDF files constructed by ferenda.DocumentRepository.parse().

The generation is done by XSLT, and normally you won’t need to override this, but you might want to provide your own xslt file and set ferenda.DocumentRepository.xslt_template to the name of that file.

If you want to generate your browser-ready HTML by any other means than XSLT, you should override this method.

Parameters:	basefile (str) – The basefile for which to generate HTML
Returns:	None

get_url_transform_func(repos, basedir)[source]¶: Returns a function that, when called with a URI, transforms that URI to another suitable reference. This can be used to eg. map between canonical URIs and local URIs. The function is run on all URIs in a post-processing step after generate() runs. The default implementatation maps URIs to local file paths, and is only run if config.staticsite``is ``True.

prep_annotation_file(basefile)[source]¶

Helper function used by generate() – prepares a RDF/XML file containing statements that in some way annotates the information found in the document that generate handles, like URI/title of other documents that refers to this one.

Parameters:	basefile (str) – The basefile for which to collect annotating statements.
Returns:	The full path to the prepared RDF/XML file
Return type:	str

construct_annotations(uri)[source]¶: Construct a RDF graph containing metadata by running the query provided by construct_sparql_query()

construct_sparql_query(uri)[source]¶: Construct a SPARQL query that will select metadata relating to uri in some way, using the query template specified by sparql_annotations

graph_to_annotation_file(graph)[source]¶

Converts a RDFLib graph into a XML file with the same statements, ordered using the Grit format (https://code.google.com/p/oort/wiki/Grit) for easier XSLT inclusion.

Parameters:	graph (rdflib.graph.Graph) – The graph to convert
Returns:	A serialized XML document with the RDF statements
Return type:	str

annotation_file_to_graph(annotation_file)[source]¶

Converts a annotation file (using the Grit format) back into an RDFLib graph.

Parameters:	graph (str) – The filename of a serialized XML document with RDF statements
Returns:	The RDF statements as a regular graph
Return type:	rdflib.Graph

generated_url(basefile)[source]¶

Get the full local url for the generated file for the given basefile.

Parameters:	basefile (str) – The basefile for which to calculate the local url
Returns:	The local url
Return type:	str

toc(otherrepos=[])[source]¶

Creates a set of pages that together acts as a table of contents for all documents in the repository. For smaller repositories a single page might be enough, but for repositoriees with a few hundred documents or more, there will usually be one page for all documents starting with A, starting with B, and so on. There might be different ways of browseing/drilling down, i.e. both by title, publication year, keyword and so on.

The default implementation calls faceted_data() to get all data from the triple store, facets() to find out the facets for ordering, toc_pagesets() to calculate the total set of TOC html files, toc_select_for_pages() to create a list of documents for each TOC html file, and finally toc_generate_pages() to create the HTML files. The default implemention assumes that documents have a title (in the form of a dcterms:title property) and a publication date (in the form of a dcterms:issued property).

You can override any of these methods to customize any part of the toc generation process. Often overriding facets() to specify other document properties will be sufficient.

toc_pagesets(data, facets)[source]¶

Calculate the set of needed TOC pages based on the result rows

Parameters:	data – list of dicts, each dict containing metadata about a single document facets – list of Facet objects
Returns:	A set of Pageset objects
Return type:	list

Example:

>>> d = DocumentRepository()
>>> from rdflib.namespace import DCTERMS
>>> rows = [{'uri':'http://ex.org/1','dcterms_title':'Abc','dcterms_issued':'2009-04-02'},
...         {'uri':'http://ex.org/2','dcterms_title':'Abcd','dcterms_issued':'2010-06-30'},
...         {'uri':'http://ex.org/3','dcterms_title':'Dfg','dcterms_issued':'2010-08-01'}]
>>> from rdflib.namespace import DCTERMS
>>> facets = [Facet(DCTERMS.title), Facet(DCTERMS.issued)]
>>> pagesets=d.toc_pagesets(rows,facets)
>>> pagesets[0].label
'Sorted by title'
>>> pagesets[0].pages[0]
<TocPage binding=dcterms_title linktext=a title=Documents starting with "a" value=a>
>>> pagesets[0].pages[0].linktext
'a'
>>> pagesets[0].pages[0].title
'Documents starting with "a"'
>>> pagesets[0].pages[0].binding
'dcterms_title'
>>> pagesets[0].pages[0].value
'a'
>>> pagesets[1].label
'Sorted by publication year'
>>> pagesets[1].pages[0]
<TocPage binding=dcterms_issued linktext=2009 title=Documents published in 2009 value=2009>

toc_select_for_pages(data, pagesets, facets)[source]¶

Go through all data rows (each row representing a document) and, for each toc page, select those documents that are to appear in a particular page.

Example:

>>> d = DocumentRepository()
>>> rows = [{'uri':'http://ex.org/1','dcterms_title':'Abc','dcterms_issued':'2009-04-02'},
...         {'uri':'http://ex.org/2','dcterms_title':'Abcd','dcterms_issued':'2010-06-30'},
...         {'uri':'http://ex.org/3','dcterms_title':'Dfg','dcterms_issued':'2010-08-01'}]
>>> from rdflib.namespace import DCTERMS
>>> facets = [Facet(DCTERMS.title), Facet(DCTERMS.issued)]
>>> pagesets=d.toc_pagesets(rows,facets)
>>> expected={('dcterms_title','a'):[[Link('Abc',uri='http://ex.org/1')],
...                                  [Link('Abcd',uri='http://ex.org/2')]],
...           ('dcterms_title','d'):[[Link('Dfg',uri='http://ex.org/3')]],
...           ('dcterms_issued','2009'):[[Link('Abc',uri='http://ex.org/1')]],
...           ('dcterms_issued','2010'):[[Link('Abcd',uri='http://ex.org/2')],
...                                      [Link('Dfg',uri='http://ex.org/3')]]}
>>> d.toc_select_for_pages(rows, pagesets, facets) == expected
True

Parameters:	data – List of dicts as returned by `toc_select()` pagesets – Result from `toc_pagesets()` facets – Result from `facets()`
Returns:	mapping between toc basefile and documentlist for that basefile
Return type:	dict

toc_item(binding, row)[source]¶: Returns a formatted version of row, using Element objects

toc_generate_pages(pagecontent, pagesets, otherrepos=[])[source]¶

Creates a set of TOC pages by calling: toc_generate_page().

Parameters:	pagecontent – Result from `toc_select_for_pages()` pagesets – Result from `toc_pagesets()` otherrepos – A list of document repository instances

toc_generate_first_page(pagecontent, pagesets, otherrepos=[])[source]¶: Generate the main page of TOC pages.

toc_generate_page(binding, value, documentlist, pagesets, effective_basefile=None, otherrepos=[])[source]¶

Generate a single TOC page.

Parameters:

binding – The binding used (eg. ‘title’ or ‘issued’)
value – The value for the used binding (eg. ‘a’ or ‘2013’
documentlist – Result from toc_select_for_pages()
pagesets – Result from toc_pagesets()
effective_basefile – Place the resulting page somewhere else than toc/*binding*/*value*.html
otherrepos – A list of document repository instances

news(otherrepos=[])[source]¶: Create a set of Atom feeds and corresponding HTML pages for new/updated documents in different categories in the repository.

news_facet_entries(keyfunc=None, reverse=True)[source]¶

Returns a set of entries, decorated with information from faceted_data(), used for feed generation.

Parameters:	keyfunc (callable) – Function that given a dict, returns an element from that dict, used for sorting entries. reverse – The direction of the sorting
Returns:	entries, each represented as a dict
Return type:	list

news_feedsets(data, facets)[source]¶

Calculate the set of needed feedsets based on facets and instance values in the data

Parameters:	data – list of dicts, each dict containing metadata about a single document facets – list of Facet objects
Returns:	A list of Feedset objects

news_select_for_feeds(data, feedsets, facets)[source]¶

Go through all data rows (each row representing a document) and, for each newsfeed, select those document entries that are to appear in that feed

Parameters:	data – List of dicts as returned by `news_facet_entries()` feedsets – List of feedset objects, the result from `news_feedsets()` facets – Result from `facets()`
Returns:	mapping between a (binding, value) tuple and entries for that tuple!

news_item(binding, entry)[source]¶

Returns a modified version of the news entry for use in a specific feed.

You can override this if you eg. want to customize title or summary of each entry in a particular feed. The default implementation does not change the entry in any way.

Parameters:	binding (str) – identifier for the feed being constructed, derived from a facet object. entry (ferenda.DocumentEntry) – The entry object to modify
Returns:	The modified entry
Return type:	ferenda.DocumentEntry

news_entries()[source]¶: Return a generator of all available (and published) DocumentEntry objects.

news_generate_feeds(feedsets, generate_html=True)[source]¶

Creates a set of Atom feeds (and optionally HTML equivalents) by calling news_write_atom() for each feed in feedsets.

Parameters:	feedsets (list) – the result of `news_feedsets()` generate_html (bool) – Whether to generate HTML equivalents of the atom feeds

news_write_atom(entries, title, slug, archivesize=100)[source]¶

Given a list of Atom entry-like objects, including links to RDF and PDF files (if applicable), create a rinfo-compatible Atom feed, optionally splitting into archives.

Parameters:	entries (list) – `DocumentEntry` objects title (str) – feed title slug (str) – used for constructing the path where the Atom files are stored and the URL where it’s published. archivesize (int) – The amount of entries in each archive file. The main file might contain up to 2 x this amount.

frontpage_content(primary=False)[source]¶

If the module wants to provide any particular content on the frontpage, it can do so by returning a XHTML fragment (in text form) here.

Parameters:	primary (bool) – Whether the caller wants the module to take primary responsibility for the frontpage content. If `False`, the caller only expects a smaller amount of content (like a smaller presentation of the repository and the document it contains).
Returns:	the XHTML fragment
Return type:	str

If primary is true, . If primary is false, the caller only expects a smaller amount of content (like a smaller presentation of the repository and the document it contains).

status(basefile=None, samplesize=3)[source]¶: Prints out some basic status information about this repository.

get_status()[source]¶

Returns basic data about the state about this repository, used by status(). Returns a dict of dicts, one per state (‘download’, ‘parse’ and ‘generated’), each containing lists under the ‘exists’ and ‘todo’ keys.

Returns:	Status information
Return type:	dict

tabs()[source]¶

Get the navigation menu segment(s) provided by this docrepo.

Returns a list of tuples, where each tuple will be rendered as a tab in the main UI. First element of the tuple is the link text, and the second is the link destination. Normally, a module will only return a single tab.

Returns:	(link text, link destination) tuples
Return type:	list

Example:

>>> d = DocumentRepository()
>>> d.tabs()
[('base', 'http://localhost:8000/dataset/base')]

footer()[source]¶

Get a list of resources provided by this repo for publication in the site footer.

Works like tabs(), but normally returns an empty list. The repo ferenda.sources.general.Static is an exception.

http_handle(environ)[source]¶: Used by the WSGI support to indicate if this repo can provide a response to a particular request. If so, returns a tuple (fp, length, memtype), where fp is an open file of the document to be returned.

The DocumentRepository class¶

The `DocumentRepository` class¶