In some cases, a document collection may available from multiple sources, with varying degrees of completeness and/or quality. For example, in a collection of US patents, some patents may be available in structured XML with good metadata through a easy-to-use API, some in tag-soup style HTML with no metadata, requiring screenscraping, and some in the form of TIFF files that you scanned yourself. The implementation of both download() and parse() will differ wildly for these sources. You’ll have something like this:
from ferenda import DocumentRepository, CompositeRepository from ferenda.decorators import managedparsing class XMLPatents(DocumentRepository): alias = "patxml" def download(self, basefile = None): download_from_api() @managedparsing def parse(self,doc): return self.transform_patent_xml_to_xhtml(doc) class HTMLPatents(DocumentRepository): alias = "pathtml" def download(self, basefile=None): screenscrape() @managedparsing def parse(self,doc): return self.analyze_tagsoup(doc) class ScannedPatents(DocumentRepository): alias = "patscan" # Assume that we, when we scanned the documents, placed them in their # correct place under data/patscan/downloaded def download(self, basefile=None): pass @managedparsing def parse(self,doc): x = self.ocr_and_structure(doc) return True
But since the result of all three parse() implementations are XHTML1.1+RDFa documents (possibly with varying degrees of data fidelity), the implementation of generate() will be substantially the same. Furthermore, you probably want to present a unified document collection to the end user, presenting documents derived from structured XML if they’re available, documents derived from tagsoup HTML if an XML version wasn’t available, and finally documents derived from your scanned documents if nothing else is available.
The class CompositeRepository makes this possible. You specify a number of subordinate docrepo classes using the subrepos class property.
class CompositePatents(CompositeRepository): alias = "pat" # Specify the classes in order of preference for parsed documents. # Only if XMLPatents does not have a specific patent will HTMLPatents # get the chance to provide it through it's parse method subrepos = XMLPatents, HTMLPatents, ScannedPatents def generate(self, basefile, otherrepos=): # Optional code to transform parsed XHTML1.1+RDFa documents # into browser-ready HTML5, regardless of wheter these are # derived from structured XML, tagsoup HTML or scanned # TIFFs. If your parse() method can make these parsed # documents sufficiently alike and generic, you might not need # to implement this method at all. self.do_the_work(basefile)
The CompositeRepository docrepo then acts as a proxy for all of your specialized repositories:
$ ./ferenda-build.py patents.CompositePatents enable # calls download() for all subrepos $ ./ferenda-build.py pat download # selects the best subrepo that has patent 5,723,765, calls parse() # for that, then copies the result to pat/parsed/ 5723765 (or links) $ ./ferenda-build.py pat parse 5723765 # uses the pat/parsed/5723765 data. From here on, we're just like any # other docrepo. $ ./ferenda-build.py pat generate 5723765
Note that patents.XMLPatents and the other subrepos are never registered in ferenda.ini``. They’re just called behind-the-scenes by patents.CompositePatents.
It is not uncommon that source documents in a document repository contains formatting irregularities, sensitive information that must be redacted, or just outright errors. In some cases, your parse implementation can detect and correct these things, but in other cases, the irregularities are so uncommon or unique that this is not possible to do in a general way.
As an alternative, you can patch the source document (or it’s intermediate representation) before the main part of your parsing logic.
The method patch_if_needed() automates most of this work for you. It expects a basefile and the corresponding source document as a string, looks in a patch directory for a corresponding patch file, and applies it if found.
By default, the patch directory is alongside the data directory. The patch file for document foo in repository bar should be placed in patches/bar/foo.patch. An optional description of the patch (as a plaintext, UTF-8 encoded file) can be placed in patches/bar/foo.desc.
patch_if_needed() returns a tuple (text, description). If there was no available patch, text is identical to the text passed in and description is None. If there was a patch available and it applied cleanly, text is the patched text and description is a description of the patch (or “(No patch description available)”). If there was a patch, but it didn’t apply cleanly, a PatchError is raised.
There is a mkpatch command in the Devel class which aims to automate the creation of patch files. It does not work at the moment.
Ferenda contains a general docrepo class that fetches data from a separate MediaWiki server and stores this as annotations/descriptions related to the documents in your main docrepos. This makes it possible to present a source document and commentary on it (including annotations about individual sections) side-by-side.
Ferenda also contains a general docrepo class that lists all keywords used by documents in your main docrepos (by default, it looks for all dcterms:subject properties used in any document) and generate documents for each of them. These documents have no content of their own, but act as hub pages that list all documents that use a certain keyword in one place.
When used together with the MediaWiki module above, this makes it possible to write editorial descriptions about each keyword used, that is presented alongside the list of documents that use that keyword.
Custom common data¶
In many cases, you want to describe documents using references to other things that are not documents, but which should be named using URIs rather than plain text literals. This includes things like companies, publishing entities, print series and abstract things like the topic/keyword of a document. You can define a RDF graph containing more information about each such thing that you know of beforehand, eg if we want to model that some RFCs are published in the Internet Architecture Board (IAB) stream, we can define the following small graph:
<http://localhost:8000/ext/iab> a foaf:Organization; foaf:name "Internet Architecture Board (IAB)"; skos:altLabel "IAB"; foaf:homepage <https://www.iab.org/> .
If this is placed in res/extra/[alias].ttl, eg res/extra/rfc.ttl, the graph is made available as commondata, and is also provided as the third resource_graph argument to any selector/key functions of your Facet objects.
Some parts of ferenda, notably The ReST API for querying, can make use of ontologies that your docrepo uses. This is so far only used to provide human-readable descriptions of predicates used (as determined by rdfs:label or rdfs:comment). Ferenda will try to find an ontology for any namespace you use in namespaces, and directly supports many common vocabularies (bibo, dc, dcterms, foaf, prov, rdf, rdfs, schema and skos). If you have defined your own custom ontology, place it (in Turtle format) as res/vocab/[alias].ttl, eg. res/vocab/rfc.ttl to make Ferenda read it.
It’s common to use ferenda with document collections with tens of thousands of documents. If a single document takes a second to parse, it means the entire document collection will take three hours or more, which is not ideal for quick turnaround. Ferenda, and in particular the ferenda-build.py tool, can run tasks in parallel to speed things up.
Multiprocessing on a single machine¶
The simplest way of speeding up processing is to use the processes parameter, eg:
./ferenda-build.py rfc parse --all --processes=4
This will create 4 processes (started by a fifth control proccess), each processing individual documents as instructed by the control process. As a rule of thumb, you should create as many processes as you have CPU cores.
A more complex, but also more scalable way, is to set up a bunch of computers acting as processing clients, together with a main (control) system. Each of these clients must have access to the same code and data directory as the main system (ie they should all mount the same network file system). On each client, you then run (assuming that your main system has the IP address 192.168.1.42, and that this particular client has 4 CPU cores):
./ferenda-build.py all buildclient --serverhost=192.168.1.42 --processes=4
On the main system, you first start a message queue with:
./ferenda-build.py all buildqueue
Then you can run ferenda-build.py as normal but with the buildqueue parameter, eg:
./ferenda-build rfc parse --all --buildqueue
This will put each file to be processed in the message queue, where all clients will pick up these jobs and process them.
The clients and the message queue can be kept running indefinitely (although the clients will need to be restarted when you change the code that they’re running).
If you’re not running ferenda on windows, you can skip the separate message queue process. Just start your clients like above, then start ferenda-build.py on your main system with the buildserver parameter, eg:
./ferenda-build.py rfc parse --all --buildserver
This sets up a temporary in-subprocess message queue that your clients will connect to as soon as it’s up.
Because of reasons, this in-subprocess queue does not work on Windows. On that platform you’ll need to run the message queue separately, as described initially.