The DocumentRepository
class¶
-
class
ferenda.
DocumentRepository
(config=None, **kwargs)[source]¶ Base class for handling a repository of documents.
Handles downloading, parsing and generation of HTML version of documents. Start building your application by subclassing this class, and then override methods in order to customize the downloading, parsing and generation behaviour.
Parameters: **kwargs – Any named argument overrides any similarly-named Configuration file parameter. Example:
>>> class MyRepo(DocumentRepository): ... alias="myrepo" ... >>> d = MyRepo(datadir="/tmp/ferenda") >>> d.store.downloaded_path("mybasefile").replace(os.sep,'/') '/tmp/ferenda/myrepo/downloaded/mybasefile.html'
Note
This class has a ridiculous amount of properties and methods that you can override to control most of Ferendas behaviour in all stages. For basic usage, you need only a fraction of them. Please don’t be intimidated/horrified.
-
alias
= 'base'¶ A short name for the class, used by the command line
ferenda-build.py
tool. Also determines where to store downloaded, parsed and generated files. When you subclassDocumentRepository
you must override this.
-
storage_policy
= 'file'¶ Some repositories have documents in several formats, documents split amongst several files or embedded resources. If
storage_policy
is set todir
, then each document gets its own directory (the default filename beingindex
+suffix), otherwise each doc gets stored as a file in a directory with other files. Affectsferenda.DocumentStore.path()
(and therefore all other*_path
methods)
-
namespaces
= ['rdf', 'rdfs', 'xsd', 'xsi', 'dcterms', 'skos', 'foaf', 'xhv', 'owl', 'prov', 'bibo']¶ The namespaces that are included in the XHTML and RDF files generated by
parse()
. This can be a list of strings, in which case the strings are assumed to be well-known prefixes to established namespaces, or a list of (prefix, namespace) tuples. All well-known prefixes are available inferenda.util.ns
.If you specify a namespace for a well-known ontology/vocabulary, that onlology will be available as a
Graph
from theontologies
property.
-
collate_locale
= None¶ The locale to be used for sorting (collating). This affects TOCs, see Defining facets for grouping and sorting.
-
loadpath
= None¶ If defined (by default it’s
None
), this should be a list of directories that takes precedence over the loadpath given by the current config.
-
lang
= 'en'¶ The language (expressed as a two-letter ISO 639-1 code) which the source documents are assumed to be written in (unless otherwise specified), and the language which output document should use.
-
start_url
= 'http://example.org/'¶ The main entry page for the remote web store of documents. May be a list of documents, a search form or whatever. If it’s something more complicated than a simple list of documents, you need to override
download()
in order to tell which documents are to be downloaded.
-
document_url_template
= 'http://example.org/docs/%(basefile)s.html'¶ A string template for creating URLs for individual documents on the remote web server. Directly used by
remote_url()
and indirectly bydownload_single()
.
-
document_url_regex
= 'http://example.org/docs/(?P<basefile>\\w+).html'¶ A regex that matches URLs for individual documents – the reverse of what
document_url_template
is used for. Used bydownload()
to find suitable links ifbasefile_regex
doesn’t match. Must define the named groupbasefile
using the(?P<basefile>...)
syntax
-
basefile_regex
= '^ID: ?(?P<basefile>[\\w\\d\\:\\/]+)$'¶ A regex for matching document names in link text, as used by
download()
. Must define a named groupbasefile
, just likedocument_url_template
.
-
downloaded_suffix
= '.html'¶ File suffix for the main document format. Determines the suffix of downloaded files.
-
download_archive
= True¶ If
True
(the default), any attempt by download_single to download a basefile that already exists will cause the old version to be archived. See Archiving.
-
download_iterlinks
= True¶ If
True
(the default),download_get_basefiles()
will be called with an iterator that returns (element, attribute, link, pos) tuples (likelxml.etree.iterlinks()
does). Othervise, it will be called with the downloaded index page as a string.
-
download_accept_404
= False¶ If
True
(default:False
), any 404 HTTP error encountered during download will NOT raise and error. Instead, the download process will just move on to the next identified basefile.
-
download_accept_406
= False¶
-
download_accept_400
= False¶
-
download_reverseorder
= False¶ It
True
(default:False
), download_get_basefiles will process recieved basefiles in reverse order.
-
source_encoding
= 'utf-8'¶ The character set that the source documents use (if applicable).
-
rdf_type
= rdflib.term.URIRef('http://xmlns.com/foaf/0.1/Document')¶ The RDF type of the documents you are handling (expressed as a
rdflib.term.URIRef
object).Note
If your repo produces documents of several different types, you can define this as a list (or other iterable) of
URIRef
objects.faceted_data()
will only find documents that are any of the types.
-
required_predicates
= [rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type')]¶ A list of RDF predicates that should be present in the outdata. If any of these are missing from the result of
parse()
, a warning is logged. You can add to this list as a form of simple validation of your parsed data.
-
max_resources
= 1000¶ The maximum number of sub-resources (as defined by having a specific URI) that documents in this repo can have. This is checked in a validation step at the end of parse. If set to None, no validation of the number of resources is done.
-
parse_content_selector
= 'body'¶ CSS selector used to select the main part of the document content by the default
parse()
implementation.
-
parse_filter_selectors
= ['script']¶ CSS selectors used to filter/remove certain parts of the document content by the default
parse()
implementation.
-
xslt_template
= 'xsl/generic.xsl'¶ A template used by
generate()
to transform the XML file into browser-ready HTML. If your document type is complex, you might want to override this (and write your own XSLT transform). You should includebase.xslt
in that template, though.
-
sparql_annotations
= 'sparql/annotations.rq'¶ A template SPARQL CONSTRUCT query for document annotations.
-
sparql_expect_results
= True¶ If
True
(the default) and thesparql_annotations_query
doesn’t return any results, issue a warning.
-
documentstore_class
¶ alias of
ferenda.documentstore.DocumentStore
-
requesthandler_class
¶ alias of
ferenda.requesthandler.RequestHandler
-
ontologies
¶ Provides a
Graph
loaded with the ontologies/vocabularies that this docrepo uses (as determined by thenamespaces`
property).If you’re using your own vocabularies, you can place them (in Turtle format) as
vocab/[prefix].ttl
somewhere in your resource loadpath to have them loaded into the graph.Note
Some system-like vocabularies (
rdf
,rdfs
andowl
) are never loaded into the graph.
-
commondata
¶ Provides a
Graph
containing any extra data that is common to documents in this docrepo – this can be information about different entities that publishes the documents, the printed series in which they’re published, and so on. The data is taken fromextra/[repoalias].ttl
.
-
config
¶ The
LayeredConfig
object that contains the current configuration for this docrepo instance. You can read or write individual properties of this object, or replace it with a newLayeredConfig
object entirely.
-
lookup_resource
(label, predicate=rdflib.term.URIRef('http://xmlns.com/foaf/0.1/name'), cutoff=0.8, warn=True)[source]¶ Given a textual identifier (ie. the name for something), lookup the canonical uri for that thing in the RDF graph containing extra data (i.e. the graph that
commondata
provides). The graph should have a foaf:name`` statement about the url with the sought label as the object.Since data is imperfect, the textual label may be spelled or expressed different in different contexts. This method therefore performs fuzzy matching (using
difflib.get_close_matches()
) using the cutoff parameter determines exactly how fuzzy this matching is.If no resource matches the given label, a
KeyError
is raised.Parameters: - label (str) – The textual label to lookup
- predicate (rdflib.term.URIRef) – The RDF predicate to use when looking for the label
- cutoff (float) – How fuzzy the matching may be (1 = must match exactly, 0 = anything goes)
- warn (bool) – Whether to log a warning when an inexact match is performed
Returns: The matching resource
Return type:
-
classmethod
get_default_options
()[source]¶ Returns the class’ configuration default configuration properties. These can be overridden by a configution file, or by named arguments to
__init__()
. See Configuration for a list of standard configuration properties (your subclass is free to define and use additional configuration properties).Returns: default configuration properties Return type: dict
-
classmethod
setup
(action, config, *args, **kwargs)[source]¶ Runs before any of the
*_all
methods starts executing. It just calls the appropriate setup method, ie if action isparse
, then this method callsparse_all_setup
(if defined) with the config object as single parameter.
-
classmethod
teardown
(action, config, *args, **kwargs)[source]¶ Runs after any of the
*_all
methods has finished executing. It just calls the appropriate teardown method, ie if action isparse
, then this method callsparse_all_teardown
(if defined) with the config object as single parameter.
-
get_archive_version
(basefile)[source]¶ Get a version identifier for the current version of the document identified by
basefile
.The default implementation simply increments most recent archived version identifier, starting at “1”. If versions in your docrepo are normally identified in some other way (such as SCM revision numbers, dates or similar) you should override this method to return those identifiers.
Parameters: basefile (str) – The basefile of the document to archive Returns: The version identifier for the current version of the document. Return type: str
-
qualified_class_name
()[source]¶ The qualified class name of this class
Returns: class name (e.g. ferenda.DocumentRepository
)Return type: str
-
canonical_uri
(basefile)[source]¶ The canonical URI for the document identified by
basefile
.Returns: The canonical URI Return type: str
-
dataset_uri
(param=None, value=None, feed=False)[source]¶ Returns the URI that identifies the dataset that this docrepository provides. The default implementation is based on the url config parameter and the alias attribute of the class, c.f.
http://localhost:8000/dataset/base
.Parameters: - param – An optional parameter name represeting a way of createing a subset of the dataset (eg. all document whose title starts with a particular letter)
- value – A value for param (eg. “a”)
>>> d = DocumentRepository() >>> d.alias 'base' >>> d.config.url = "http://example.org/" >>> d.dataset_uri() 'http://example.org/dataset/base' >>> d.dataset_uri("title","a") 'http://example.org/dataset/base?title=a' >>> d.dataset_uri(feed=True) 'http://example.org/dataset/base/feed' >>> d.dataset_uri("title", "a", feed=True) 'http://example.org/dataset/base/feed?title=a' >>> d.dataset_uri("title", "a", feed=".atom") 'http://example.org/dataset/base/feed.atom?title=a'
-
basefile_from_uri
(uri)[source]¶ The reverse of
canonical_uri()
. ReturnsNone
if the uri doesn’t map to a basefile in this repo.>>> d = DocumentRepository() >>> d.alias 'base' >>> d.config.url = "http://example.org/" >>> d.basefile_from_uri("http://example.org/res/base/123/a") '123/a' >>> d.basefile_from_uri("http://example.org/res/base/123/a#S1") '123/a' >>> d.basefile_from_uri("http://example.org/res/other/123/a") # None
-
download
(basefile=None, reporter=None)[source]¶ Downloads all documents from a remote web service.
The default generic implementation assumes that all documents are linked from a single page (which has the url of
start_url
), that they all have URLs matching thedocument_url_regex
or that the link text is always equal to basefile (as determined bybasefile_regex
). If these assumptions don’t hold, you need to override this method.If you do override it, your download method should read and set the
lastdownload
parameter to either the datetime of the last download or any other module-specific string (id number or similar).You should also read the
refresh
parameter. If it isTrue
(the default), then you should calldownload_single()
for every basefile you encounter, even though they may already exist in some form on disk.download_single()
will normally be using conditional GET to see if there is a newer version available.See Writing your own download implementation for more details.
Returns: True if any document was downloaded, False otherwise. Return type: bool
-
download_get_basefiles
(source)[source]¶ Given source (a iterator that provides (element, attribute, link, pos) tuples, like
lxml.etree.iterlinks()
), generate tuples (basefile, link) for all document links found in source.
-
download_single
(basefile, url=None, orig_url=None)[source]¶ Downloads the document from the web (unless explicitly specified, the URL to download is determined by
document_url_template
combined with basefile, the location on disk is determined by the functiondownloaded_path()
).If the document exists on disk, but the version on the web is unchanged (determined using a conditional GET), the file on disk is left unchanged (i.e. the timestamp is not modified).
Parameters: - basefile (string) – The basefile of the document to download
- url (str) – The URL to download (optional)
- url – The URL to store in the documententry file (might be a landing page containing the actual document URL)
Returns: True
if the document was downloaded and stored on disk,False
if the file on disk was not updated.
-
download_if_needed
(url, basefile, archive=True, filename=None, sleep=1, extraheaders=None)[source]¶ Downloads a remote resource to a local file. If a different version is already in place, archive that old version.
Parameters: - url (str) – The url to download
- basefile (str) – The basefile of the document to download
- archive (bool) – Whether to archive existing older versions of the document, or just delete the previously downloaded file.
- filename (str) – The filename to download to. If not provided, the filename is derived from the supplied basefile
Returns: True if the local file was updated (and archived), False otherwise.
Return type:
-
download_is_different
(existing, new)[source]¶ Returns True if the new file is semantically different from the existing file.
-
remote_url
(basefile)[source]¶ Get the URL of the source document at it’s remote location, unless the source document is fetched by other means or if it cannot be computed from basefile only. The default implementation uses
document_url_template
to calculate the url.Example:
>>> d = DocumentRepository() >>> d.remote_url("123/a") 'http://example.org/docs/123/a.html' >>> d.document_url_template = "http://mysite.org/archive/%(basefile)s/" >>> d.remote_url("123/a") 'http://mysite.org/archive/123/a/'
Parameters: basefile (str) – The basefile of the source document Returns: The remote url where the document can be fetched, or None
.Return type: str
-
generic_url
(basefile, maindir, suffix)[source]¶ Analogous to
ferenda.DocumentStore.path()
, calculate the full local url for the given basefile and stage of processing.Parameters: Returns: The local url
Return type:
-
downloaded_url
(basefile)[source]¶ Get the full local url for the downloaded file for the given basefile.
Parameters: basefile (str) – The basefile for which to calculate the local url Returns: The local url Return type: str >>> d = DocumentRepository() >>> d.downloaded_url("123/a") 'http://localhost:8000/base/downloaded/123/a.html'
-
classmethod
parse_all_setup
(config, *args, **kwargs)[source]¶ Runs any action needed prior to parsing all documents in a docrepo. The default implementation does nothing.
Note
This is a classmethod for now (and that’s why a config object is passsed as an argument), but might change to a instance method.
-
classmethod
parse_all_teardown
(config, *args, **kwargs)[source]¶ Runs any cleanup action needed after parsing all documents in a docrepo. The default implementation does nothing.
Note
Like
parse_all_setup()
this might change to a instance method.
-
parse
(doc, needed=True)[source]¶ Parse downloaded documents into structured XML and RDF.
It will also save the same RDF statements in a separate RDF/XML file.
You will need to provide your own parsing logic, but often it’s easier to just override parse_{metadata, document}_from_soup (assuming your indata is in a HTML format parseable by BeautifulSoup) and let the base class read and write the files.
If your data is not in a HTML format, or BeautifulSoup is not an appropriate parser to use, override this method.
Parameters: doc (ferenda.Document) – The document object to fill in.
-
parse_entry_id
(doc)[source]¶ Construct a id (URI) for the document, to be stored in it’s DocumentEntry json file.
Normally, this is identical to the main document URI as specified in doc.uri.
-
parse_entry_title
(doc)[source]¶ Construct a useful title for the document, like it’s dcterms:title, to be stored in it’s DocumentEntry json file.
-
parse_entry_summary
(doc)[source]¶ Construct a useful summary for the document, like it’s dcterms:abstract, to be stored in it’s DocumentEntry json file.
-
soup_from_basefile
(basefile, encoding='utf-8', parser='lxml')[source]¶ Load the downloaded document for basefile into a BeautifulSoup object
Parameters: Returns: The parsed document as a
BeautifulSoup
objectNote
Helper function. You probably don’t need to override it.
-
parse_metadata_from_soup
(soup, doc)[source]¶ Given a BeautifulSoup document, retrieve all document-level metadata from it and put it into the given
doc
object’smeta
property.Note
The default implementation sets
rdf:type
,dcterms:title
,dcterms:identifier
andprov:wasGeneratedBy
properties indoc.meta
, as well as setting the language of the document indoc.lang
.Parameters: - soup – A parsed document, as
BeautifulSoup
object - doc (ferenda.Document) – Our document
Returns: None
- soup – A parsed document, as
-
parse_document_from_soup
(soup, doc)[source]¶ Given a BeautifulSoup document, convert it into the provided
doc
object’sbody
property as suitableferenda.elements
objects.Note
The default implementation respects
parse_content_selector
andparse_filter_selectors
.Parameters: - soup – A parsed document as a
BeautifulSoup
object - doc (ferenda.Document) – Our document
Returns: None
- soup – A parsed document as a
-
patch_if_needed
(basefile, text)[source]¶ Given basefile and the entire text of the downloaded or intermediate document, find if there exists a patch file under
self.config.patchdir
, and if so, applies it. Returns (patchedtext, patchdescription) if so, (text,None) otherwise.Parameters:
-
make_document
(basefile=None)[source]¶ Create a
Document
objects with basic initialized fields.Note
Helper method used by the
makedocument()
decorator.Parameters: basefile (str) – The basefile for the document Return type: ferenda.Document
-
make_graph
()[source]¶ Initialize a rdflib Graph object with proper namespace prefix bindings (as determined by
namespaces
)Return type: rdflib.Graph
-
create_external_resources
(doc)[source]¶ Optionally create external files that go together with the parsed file (stylesheets, images, etc).
The default implementation does nothing.
Parameters: doc (ferenda.Document) – The document
-
render_xhtml
(doc, outfile=None)[source]¶ Renders the parsed object structure as a XHTML file with RDFa attributes (also returns the same XHTML as a string).
Parameters: - doc (ferenda.Document) – The document to render
- outfile (str) – The file name for the XHTML document
Returns: The XHTML document
Return type:
-
render_xhtml_tree
(doc)[source]¶ Renders the parsed object structure as a
lxml.etree._Element
object.Parameters: doc (ferenda.Document) – The document to render Returns: The XHTML document as a lxml structure Return type: lxml.etree._Element
-
parsed_url
(basefile)[source]¶ Get the full local url for the parsed file for the given basefile.
Parameters: basefile (str) – The basefile for which to calculate the local url Returns: The local url Return type: str
-
distilled_url
(basefile)[source]¶ Get the full local url for the distilled RDF/XML file for the given basefile.
Parameters: basefile (str) – The basefile for which to calculate the local url Returns: The local url Return type: str
-
classmethod
relate_all_setup
(config, *args, **kwargs)[source]¶ Runs any cleanup action needed prior to relating all documents in a docrepo. The default implementation clears the corresponsing context (see
dataset_uri()
) in the triple store.Note
Like
parse_all_setup()
this might change to a instance method.Returns False if no relation needs to be done (as determined by the timestamp on the dump nt file)
-
classmethod
relate_all_teardown
(config, *args, **kwargs)[source]¶ Runs any cleanup action needed after relating all documents in a docrepo. The default implementation dumps all RDF data loaded into the triplestore into one giant N-Triples file.
Note
Like
parse_all_setup()
this might change to a instance method.
-
relate
(basefile, otherrepos=[], needed=RelateNeeded(fulltext=True, dependencies=True, triples=True))[source]¶ Runs various indexing operations for the document.
This includes inserting RDF statements into a triple store, adding this document to the dependency list to all documents that it refers to, and putting the text of the document into a fulltext index.
-
relate_triples
(basefile, removesubjects=False)[source]¶ Insert the (previously distilled) RDF statements into the triple store.
Parameters: Returns: None
-
relate_dependencies
(basefile, repos=[])[source]¶ For each document that the basefile document refers to, attempt to find this document in the current or any other docrepo, and add the parsed document path to that documents dependency file.
-
add_dependency
(basefile, dependencyfile)[source]¶ Add the dependencyfile to basefile s dependency file. Returns True if anything new was added, False otherwise
-
relate_fulltext
(basefile, repos=None)[source]¶ Index the text of the document into fulltext index. Also indexes all metadata that facets() indicate should be indexed.
Parameters: basefile (str) – The basefile for the document to be indexed. Returns: None
-
facets
()[source]¶ Provides a list of
Facet
objects that specify how documents in your docrepo should be grouped.Override this if you want to specify your own way of grouping data in your docrepo.
-
faceted_data
()[source]¶ Provides a list of dicts, each containing a row of information about a single document in the repository. The exact fields provided are controlled by the list of
Facet
objects returned byfacet()
.Note
The same document can occur multiple times if any of it’s facets have
multiple_values
set, once for each different values that that facet has.
-
facet_query
(context)[source]¶ Constructs a SPARQL SELECT query that fetches all information needed to create faceted data.
Parameters: context (str) – The context (named graph) to which to limit the query. Returns: The SPARQL query Return type: str Example:
>>> d = DocumentRepository() >>> expected = """PREFIX dcterms: <http://purl.org/dc/terms/> ... PREFIX foaf: <http://xmlns.com/foaf/0.1/> ... PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> ... ... SELECT DISTINCT ?uri ?rdf_type ?dcterms_title ?dcterms_publisher ?dcterms_identifier ?dcterms_issued ... FROM <http://example.org/ctx/base> ... WHERE { ... ?uri rdf:type foaf:Document . ... OPTIONAL { ?uri rdf:type ?rdf_type . } ... OPTIONAL { ?uri dcterms:title ?dcterms_title . } ... OPTIONAL { ?uri dcterms:publisher ?dcterms_publisher . } ... OPTIONAL { ?uri dcterms:identifier ?dcterms_identifier . } ... OPTIONAL { ?uri dcterms:issued ?dcterms_issued . } ... ... }""" >>> d.facet_query("http://example.org/ctx/base") == expected True
-
facet_select
(query)[source]¶ Select all data from the triple store needed to create faceted data.
Parameters: context (str) – The context (named graph) to restrict the query to. If None, search entire triplestore. Returns: The results of the query, as python objects Return type: set of dicts
-
classmethod
generate_all_setup
(config, *args, **kwargs)[source]¶ Runs any action needed prior to generating all documents in a docrepo. The default implementation does nothing.
Note
Like
parse_all_setup()
this might change to a instance method.
-
classmethod
generate_all_teardown
(config, *args, **kwargs)[source]¶ Runs any cleanup action needed after generating all documents in a docrepo. The default implementation does nothing.
Note
Like
parse_all_setup()
this might change to a instance method.
-
generate
(basefile, otherrepos=[], needed=True)[source]¶ Generate a browser-ready HTML file from structured XML and RDF.
Uses the XML and RDF files constructed by
ferenda.DocumentRepository.parse()
.The generation is done by XSLT, and normally you won’t need to override this, but you might want to provide your own xslt file and set
ferenda.DocumentRepository.xslt_template
to the name of that file.If you want to generate your browser-ready HTML by any other means than XSLT, you should override this method.
Parameters: basefile (str) – The basefile for which to generate HTML Returns: None
-
get_url_transform_func
(repos=None, basedir=None, develurl=None, remove_missing=False)[source]¶ Returns a function that, when called with a URI, transforms that URI to another suitable reference. This can be used to eg. map between canonical URIs and local URIs. The function is run on all URIs in a post-processing step after
generate()
runs. The default implementatation maps URIs to local file paths, and is only run ifconfig.staticsite``is ``True
.
-
prep_annotation_file
(basefile)[source]¶ Helper function used by
generate()
– prepares a RDF/XML file containing statements that in some way annotates the information found in the document that generate handles, like URI/title of other documents that refers to this one.Parameters: basefile (str) – The basefile for which to collect annotating statements. Returns: The full path to the prepared RDF/XML file Return type: str
-
construct_annotations
(uri)[source]¶ Construct a RDF graph containing metadata by running the query provided by
construct_sparql_query()
-
construct_sparql_query
(uri)[source]¶ Construct a SPARQL query that will select metadata relating to uri in some way, using the query template specified by
sparql_annotations
-
graph_to_annotation_file
(graph)[source]¶ Converts a RDFLib graph into a XML file with the same statements, ordered using the Grit format (https://code.google.com/p/oort/wiki/Grit) for easier XSLT inclusion.
Parameters: graph (rdflib.graph.Graph) – The graph to convert Returns: A serialized XML document with the RDF statements Return type: str
-
annotation_file_to_graph
(annotation_file)[source]¶ Converts a annotation file (using the Grit format) back into an RDFLib graph.
Parameters: graph (str) – The filename of a serialized XML document with RDF statements Returns: The RDF statements as a regular graph Return type: rdflib.Graph
-
generated_url
(basefile)[source]¶ Get the full local url for the generated file for the given basefile.
Parameters: basefile (str) – The basefile for which to calculate the local url Returns: The local url Return type: str
-
transformlinks
(basefile, otherrepos=[])[source]¶ Transform links in generated HTML files.
If the
develurl
orstaticsite
settings are used, this function makes sure links are transformed to appropriate local links.
-
toc
(otherrepos=[])[source]¶ Creates a set of pages that together acts as a table of contents for all documents in the repository. For smaller repositories a single page might be enough, but for repositoriees with a few hundred documents or more, there will usually be one page for all documents starting with A, starting with B, and so on. There might be different ways of browseing/drilling down, i.e. both by title, publication year, keyword and so on.
The default implementation calls
faceted_data()
to get all data from the triple store,facets()
to find out the facets for ordering,toc_pagesets()
to calculate the total set of TOC html files,toc_select_for_pages()
to create a list of documents for each TOC html file, and finallytoc_generate_pages()
to create the HTML files. The default implemention assumes that documents have a title (in the form of adcterms:title
property) and a publication date (in the form of adcterms:issued
property).You can override any of these methods to customize any part of the toc generation process. Often overriding
facets()
to specify other document properties will be sufficient.
-
toc_pagesets
(data, facets)[source]¶ Calculate the set of needed TOC pages based on the result rows
Parameters: - data – list of dicts, each dict containing metadata about a single document
- facets – list of Facet objects
Returns: A set of Pageset objects
Return type: Example:
>>> d = DocumentRepository() >>> from rdflib.namespace import DCTERMS >>> rows = [{'uri':'http://ex.org/1','dcterms_title':'Abc','dcterms_issued':'2009-04-02'}, ... {'uri':'http://ex.org/2','dcterms_title':'Abcd','dcterms_issued':'2010-06-30'}, ... {'uri':'http://ex.org/3','dcterms_title':'Dfg','dcterms_issued':'2010-08-01'}] >>> from rdflib.namespace import DCTERMS >>> facets = [Facet(DCTERMS.title), Facet(DCTERMS.issued)] >>> pagesets=d.toc_pagesets(rows,facets) >>> pagesets[0].label 'Sorted by title' >>> pagesets[0].pages[0] <TocPage binding=dcterms_title linktext=a title=Documents starting with "a" value=a> >>> pagesets[0].pages[0].linktext 'a' >>> pagesets[0].pages[0].title 'Documents starting with "a"' >>> pagesets[0].pages[0].binding 'dcterms_title' >>> pagesets[0].pages[0].value 'a' >>> pagesets[1].label 'Sorted by publication year' >>> pagesets[1].pages[0] <TocPage binding=dcterms_issued linktext=2009 title=Documents published in 2009 value=2009>
-
toc_select_for_pages
(data, pagesets, facets)[source]¶ Go through all data rows (each row representing a document) and, for each toc page, select those documents that are to appear in a particular page.
Example:
>>> d = DocumentRepository() >>> rows = [{'uri':'http://ex.org/1','dcterms_title':'Abc','dcterms_issued':'2009-04-02'}, ... {'uri':'http://ex.org/2','dcterms_title':'Abcd','dcterms_issued':'2010-06-30'}, ... {'uri':'http://ex.org/3','dcterms_title':'Dfg','dcterms_issued':'2010-08-01'}] >>> from rdflib.namespace import DCTERMS >>> facets = [Facet(DCTERMS.title), Facet(DCTERMS.issued)] >>> pagesets=d.toc_pagesets(rows,facets) >>> expected={('dcterms_title','a'):[[Link('Abc',uri='http://ex.org/1')], ... [Link('Abcd',uri='http://ex.org/2')]], ... ('dcterms_title','d'):[[Link('Dfg',uri='http://ex.org/3')]], ... ('dcterms_issued','2009'):[[Link('Abc',uri='http://ex.org/1')]], ... ('dcterms_issued','2010'):[[Link('Abcd',uri='http://ex.org/2')], ... [Link('Dfg',uri='http://ex.org/3')]]} >>> d.toc_select_for_pages(rows, pagesets, facets) == expected True
Parameters: - data – List of dicts as returned by
toc_select()
- pagesets – Result from
toc_pagesets()
- facets – Result from
facets()
Returns: mapping between toc basefile and documentlist for that basefile
Return type: - data – List of dicts as returned by
-
toc_generate_pages
(pagecontent, pagesets, otherrepos=[])[source]¶ - Creates a set of TOC pages by calling
toc_generate_page()
.
Parameters: - pagecontent – Result from
toc_select_for_pages()
- pagesets – Result from
toc_pagesets()
- otherrepos – A list of document repository instances
-
toc_generate_first_page
(pagecontent, pagesets, otherrepos=[])[source]¶ Generate the main page of TOC pages.
-
toc_generate_page
(binding, value, documentlist, pagesets, effective_basefile=None, title=None, otherrepos=[])[source]¶ Generate a single TOC page.
Parameters: - binding – The binding used (eg. ‘title’ or ‘issued’)
- value – The value for the used binding (eg. ‘a’ or ‘2013’
- documentlist – Result from
toc_select_for_pages()
- pagesets – Result from
toc_pagesets()
- effective_basefile – Place the resulting page somewhere else
than
toc/*binding*/*value*.html
- otherrepos – A list of document repository instances
-
news_sortkey
= 'updated'¶
-
news
(otherrepos=[])[source]¶ Create a set of Atom feeds and corresponding HTML pages for new/updated documents in different categories in the repository.
-
news_facet_entries
(keyfunc=None, reverse=True)[source]¶ Returns a set of entries, decorated with information from
faceted_data()
, used for feed generation.Parameters: - keyfunc (callable) – Function that given a dict, returns an element from that dict, used for sorting entries.
- reverse – The direction of the sorting
Returns: entries, each represented as a dict
Return type:
-
news_feedsets_main_label
= 'All documents'¶
-
news_feedsets
(data, facets)[source]¶ Calculate the set of needed feedsets based on facets and instance values in the data
Parameters: - data – list of dicts, each dict containing metadata about a single document
- facets – list of Facet objects
Returns: A list of Feedset objects
-
news_entrysort_key
()[source]¶ Return a function that can act as a keyfunc in a sorted() call to sort your entries in whatever way suitable. The keyfunc takes three values (row, binding, resource_graph).
Only really used for the main feedset? The other feedsets, based on facets, use that facets keyfunc.
-
news_select_for_feeds
(data, feedsets, facets)[source]¶ Go through all data rows (each row representing a document) and, for each newsfeed, select those document entries that are to appear in that feed
Parameters: - data – List of dicts as returned by
news_facet_entries()
- feedsets – List of feedset objects, the result from
news_feedsets()
- facets – Result from
facets()
Returns: mapping between a (binding, value) tuple and entries for that tuple!
- data – List of dicts as returned by
-
news_item
(binding, entry)[source]¶ Returns a modified version of the news entry for use in a specific feed.
You can override this if you eg. want to customize title or summary of each entry in a particular feed. The default implementation does not change the entry in any way.
Parameters: - binding (str) – identifier for the feed being constructed, derived from a facet object.
- entry (ferenda.DocumentEntry) – The entry object to modify
Returns: The modified entry
Return type:
-
news_generate_feeds
(feedsets, generate_html=True)[source]¶ Creates a set of Atom feeds (and optionally HTML equivalents) by calling
news_write_atom()
for each feed in feedsets.Parameters: - feedsets (list) – the result of
news_feedsets()
- generate_html (bool) – Whether to generate HTML equivalents of the atom feeds
- feedsets (list) – the result of
-
news_write_atom
(entries, title, slug, archivesize=100)[source]¶ Given a list of Atom entry-like objects, including links to RDF and PDF files (if applicable), create a rinfo-compatible Atom feed, optionally splitting into archives.
Parameters: - entries (list) –
DocumentEntry
objects - title (str) – feed title
- slug (str) – used for constructing the path where the Atom files are stored and the URL where it’s published.
- archivesize (int) – The amount of entries in each archive file. The main file might contain up to 2 x this amount.
- entries (list) –
-
frontpage_content
(primary=False)[source]¶ If the module wants to provide any particular content on the frontpage, it can do so by returning a XHTML fragment (in text form) here.
Parameters: primary (bool) – Whether the caller wants the module to take primary responsibility for the frontpage content. If False
, the caller only expects a smaller amount of content (like a smaller presentation of the repository and the document it contains).Returns: the XHTML fragment Return type: str If primary is true, . If primary is false, the caller only expects a smaller amount of content (like a smaller presentation of the repository and the document it contains).
-
get_status
()[source]¶ Returns basic data about the state about this repository, used by
status()
. Returns a dict of dicts, one per state (‘download’, ‘parse’ and ‘generated’), each containing lists under the ‘exists’ and ‘todo’ keys.Returns: Status information Return type: dict
-
tabs
()[source]¶ Get the navigation menu segment(s) provided by this docrepo.
Returns a list of tuples, where each tuple will be rendered as a tab in the main UI. First element of the tuple is the link text, and the second is the link destination. Normally, a module will only return a single tab.
Returns: (link text, link destination) tuples Return type: list Example:
>>> d = DocumentRepository() >>> d.tabs() [('base', 'http://localhost:8000/dataset/base')]
Get a list of resources provided by this repo for publication in the site footer.
Works like
tabs()
, but normally returns an empty list. The repoferenda.sources.general.Static
is an exception.Returns: (link text, link destination) tuples Return type: list
-