Parsing and representing document metadata

Every document has a number of properties, such as it’s title, authors, publication date, type and much more. These properties are called metadata. Ferenda does not have a fixed set of which metadata properties are available for any particular document type. Instead, it encourages you to describe the document using RDF and any suitable vocabulary (or vocabularies). If you are new to RDF, a good starting point is the RDF Primer document.

Each document has a meta property which initially is an empty RDFLib Graph object. As part of the parse() method, you should fill this graph with triples (metadata statements) about the document.

Document URI

In order to create these metadata statements, you should first create a suitable URI for your document. Preferably, this should be a URI based on the URL where your web site will be published, ie if you plan on publishing it on http://mynetstandards.org/, a URI for RFC 4711 might be http://mynetstandards.org/res/rfc/4711 (ie based on the base URL, the docrepo alias, and the basefile). By changing the url variable in your project configuration file, you can set the base URL from which all document URIs are derived. If you wish to have more control over the exact way URIs are constructed, you can override canonical_uri().

Note

In some cases, there will be another canonical URI for the document you’re describing, used by other people in other contexts. In these cases, you should specifiy that the metadata you’re publishing is about the exact same object by adding a triple of the type owl:sameAs with that other canonical URI as value.

The URI for any document is available as a uri property.

Adding metadata using the RDFLib API

With this, you can create metadata for your document using the RDFLib Graph API.

    # Simpler way                   
    def parse_metadata_from_soup(self, soup, doc):
        from ferenda import Describer
        from datetime import datetime
        title = "My Document title"
        authors = ["Fred Bloggs", "Joe Shmoe"]
        identifier = "Docno 2013:4711"
        pubdate = datetime(2013,1,6,10,8,0)
        d = Describer(doc.meta, doc.uri)
        d.rdftype(self.rdf_type)
        d.value(self.ns['prov'].wasGeneratedBy, self.qualified_class_name())
        d.value(self.ns['dcterms'].title, title, lang=doc.lang)
        d.value(self.ns['dcterms'].identifier, identifier)
        for author in authors:
            d.value(self.ns['dcterms'].author, author)

A simpler way of adding metadata

The default RDFLib graph API is somewhat cumbersome for adding triples to a metadata graph. Ferenda has a convenience wrapper, Describer (itself a subclass of rdflib.extras.describer.Describer) that makes this somewhat easier. The ns class property also contains a number of references to popular vocabularies. The above can be made more succint like this:

    # Simpler way                   
    def parse_metadata_from_soup(self, soup, doc):
        from ferenda import Describer
        from datetime import datetime
        title = "My Document title"
        authors = ["Fred Bloggs", "Joe Shmoe"]
        identifier = "Docno 2013:4711"
        pubdate = datetime(2013,1,6,10,8,0)
        d = Describer(doc.meta, doc.uri)
        d.rdftype(self.rdf_type)
        d.value(self.ns['prov'].wasGeneratedBy, self.qualified_class_name())
        d.value(self.ns['dcterms'].title, title, lang=doc.lang)
        d.value(self.ns['dcterms'].identifier, identifier)
        for author in authors:
            d.value(self.ns['dcterms'].author, author)

Note

parse_metadata_from_soup() doesn’t return anything. It only modifies the doc object passed to it.

Vocabularies

Each RDF vocabulary is defined by a URI, and all terms (types and properties) of that vocabulary is typically directly derived from it. The vocabulary URI therefore acts as a namespace. Like namespaces in XML, a shorter prefix is often assigned to the namespace so that one can use rdf:type rather than http://www.w3.org/1999/02/22-rdf-syntax-ns#type. The DocumentRepository object keeps a dictionary of common (prefix,namespace)s in the class property ns – your code can modify this list in order to add vocabulary terms relevant for your documents.

Serialization of metadata

The render_xhtml() method serializes all information in doc.body and doc.meta to a XHTML+RDFa file (the exact location given by parsed_path()). The metadata specified by doc.meta ends up in the <head> section of this XHTML file.

The actual RDF statements are also distilled to a separate RDF/XML file found alongside this file (the location given by distilled_path()) for convenience.

Metadata about parts of the document

Just like the main Document object, individual parts of the document (represented as ferenda.elements objects) can have uri and meta properties. Unlike the main Document objects, these properties are not initialized beforehand. But if you do create these properties, they are used to serialize metadata into RDFa properties for each

    def parse_document_from_soup(self, soup, doc):
        from ferenda.elements import Page
        from ferenda import Describer
        part = Page(["This is a part of a document"],
                    ordinal=42,
                    uri="http://example.org/doc#42",
                    meta=self.make_graph())
        d = Describer(part.meta, part.uri)
        d.rdftype(self.ns['bibo'].DocumentPart)
        # the dcterms:identifier for a document part is often whatever
        # would be the preferred way to cite that part in another
        # document
        d.value(self.ns['dcterms'].identifier, "Doc:4711, p 42")

This results in the following document fragment:

<div xmlns="http://www.w3.org/1999/xhtml"
     about="http://example.org/doc#42"
     typeof="bibo:DocumentPart"
     class="page">
  <span property="dcterms:identifier"
	content="Doc:4711, p 42"
	xml:lang=""/>
     This is a part of a document
</div>